1. 23 8月, 2014 16 次提交
    • Y
      tcp: improve undo on timeout · 989e04c5
      Yuchung Cheng 提交于
      Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
      disabled if any retransmits were still in flight.  The concern was
      perhaps that spurious retransmission sent in a previous recovery
      episode may trigger DSACKs to falsely undo the current recovery.
      
      However, this inadvertently misses undo opportunities (using either
      TCP timestamps or DSACKs) when timeout occurs during a loss episode,
      i.e.  recurring timeouts or timeout during fast recovery. In these
      cases some retransmissions will be in flight but we should allow
      undo. Furthermore, we should only reset undo_marker and undo_retrans
      upon timeout if we are starting a new recovery episode. Finally,
      when we do reset our undo state, we now do so in a manner similar
      to tcp_enter_recovery(), so that we require a DSACK for each of
      the outstsanding retransmissions. This will achieve the original
      goal by requiring that we receive the same number of DSACKs as
      retransmissions.
      
      This patch increases the undo events by 50% on Google servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      989e04c5
    • E
      net: remove dead code after sk_data_ready change · 884cf705
      Eric Dumazet 提交于
      As a followup to commit 676d2369 ("net: Fix use after free by
      removing length arg from sk_data_ready callbacks"), we can remove
      some useless code in sock_queue_rcv_skb() and rxrpc_queue_rcv_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      884cf705
    • E
      net: use ktime_get_ns() and ktime_get_real_ns() helpers · d2de875c
      Eric Dumazet 提交于
      ktime_get_ns() replaces ktime_to_ns(ktime_get())
      
      ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2de875c
    • M
      mac80211: fix channel switch for chanctx-based drivers · 47e4df94
      Michal Kazior 提交于
      The new_ctx pointer is set only for non-chanctx drivers.  This yielded a
      crash for chanctx-based drivers during channel switch finalization:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
        IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]
      
      Use an adequate chanctx pointer to fix this.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e4df94
    • H
      af_decnet: Use time_after_eq · c0b80236
      Himangi Saraogi 提交于
      The functions time_before, time_before_eq, time_after, and time_after_eq
      are more robust for comparing jiffies against other values.
      
      A simplified version of the Coccinelle semantic patch making this change
      is as follows:
      
      @change@
      expression E1,E2,E3;
      @@
      - jiffies - E1 >= (E2*E3)
      + time_after_eq(jiffies, E1+E2*E3)
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0b80236
    • H
      decnet: Use time_after_eq · 8b1b1eb5
      Himangi Saraogi 提交于
      The functions time_before, time_before_eq, time_after, and time_after_eq
      are more robust for comparing jiffies against other values.
      
      A simplified version of the Coccinelle semantic patch making this change
      is as follows:
      
      @change@
      expression E1,E2;
      @@
      - (jiffies - E1) >= E2
      + time_after_eq(jiffies, E1+E2)
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b1b1eb5
    • H
      ipconfig: Use time_before · c72c95a0
      Himangi Saraogi 提交于
      The functions time_before, time_before_eq, time_after, and time_after_eq
      are more robust for comparing jiffies against other values.
      
      A simplified version of the Coccinelle semantic patch making this change
      is as follows:
      
      @change@
      expression E1,E2;
      @@
      - jiffies - E1 < E2
      + time_before(jiffies, E1+E2)
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72c95a0
    • H
      dn_dev: Use time_before · b5c5c36d
      Himangi Saraogi 提交于
      The functions time_before, time_before_eq, time_after, and time_after_eq
      are more robust for comparing jiffies against other values.
      
      A simplified version of the Coccinelle semantic patch making this change
      is as follows:
      
      @change@
      expression E1,E2;
      @@
      
      (
      - (jiffies - E1) < E2
      + time_before(jiffies, E1+E2)
      )
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c5c36d
    • A
      br_multicast: Replace rcu_assign_pointer() with RCU_INIT_POINTER() · 0932997e
      Andreea-Cristina Bernat 提交于
      The use of "rcu_assign_pointer()" is NULLing out the pointer.
      According to RCU_INIT_POINTER()'s block comment:
      "1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
      it is better to use it instead of rcu_assign_pointer() because it has a
      smaller overhead.
      
      The following Coccinelle semantic patch was used:
      @@
      @@
      
      - rcu_assign_pointer
      + RCU_INIT_POINTER
        (..., NULL)
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0932997e
    • A
      net/openvswitch/flow.c: Replace rcu_dereference() with rcu_access_pointer() · 8c6b00c8
      Andreea-Cristina Bernat 提交于
      The "rcu_dereference()" call is used directly in a condition.
      Since its return value is never dereferenced it is recommended to use
      "rcu_access_pointer()" instead of "rcu_dereference()".
      Therefore, this patch makes the replacement.
      
      The following Coccinelle semantic patch was used:
      @@
      @@
      
      (
       if(
       (<+...
      - rcu_dereference
      + rcu_access_pointer
        (...)
        ...+>)) {...}
      |
       while(
       (<+...
      - rcu_dereference
      + rcu_access_pointer
        (...)
        ...+>)) {...}
      )
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c6b00c8
    • A
      net/ipv4/igmp.c: Replace rcu_dereference() with rcu_access_pointer() · e6b68883
      Andreea-Cristina Bernat 提交于
      The "rcu_dereference()" call is used directly in a condition.
      Since its return value is never dereferenced it is recommended to use
      "rcu_access_pointer()" instead of "rcu_dereference()".
      Therefore, this patch makes the replacement.
      
      The following Coccinelle semantic patch was used:
      @@
      @@
      
      (
       if(
       (<+...
      - rcu_dereference
      + rcu_access_pointer
        (...)
        ...+>)) {...}
      |
       while(
       (<+...
      - rcu_dereference
      + rcu_access_pointer
        (...)
        ...+>)) {...}
      )
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6b68883
    • S
      ipv4: Restore accept_local behaviour in fib_validate_source() · 1dced6a8
      Sébastien Barré 提交于
      Commit 7a9bc9b8 ("ipv4: Elide fib_validate_source() completely when possible.")
      introduced a short-circuit to avoid calling fib_validate_source when not
      needed. That change took rp_filter into account, but not accept_local.
      This resulted in a change of behaviour: with rp_filter and accept_local
      off, incoming packets with a local address in the source field should be
      dropped.
      
      Here is how to reproduce the change pre/post 7a9bc9b8 commit:
      -configure the same IPv4 address on hosts A and B.
      -try to send an ARP request from B to A.
      -The ARP request will be dropped before that commit, but accepted and answered
      after that commit.
      
      This adds a check for ACCEPT_LOCAL, to maintain full
      fib validation in case it is 0. We also leave __fib_validate_source() earlier
      when possible, based on the same check as fib_validate_source(), once the
      accept_local stuff is verified.
      
      Cc: Gregory Detal <gregory.detal@uclouvain.be>
      Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NSébastien Barré <sebastien.barre@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dced6a8
    • D
      net: sctp: fix suboptimal edge-case on non-active active/retrans path selection · aa4a83ee
      Daniel Borkmann 提交于
      In SCTP, selection of active (T.ACT) and retransmission (T.RET)
      transports is being done whenever transport control operations
      (UP, DOWN, PF, ...) are engaged through sctp_assoc_control_transport().
      
      Commits 4c47af4d ("net: sctp: rework multihoming retransmission
      path selection to rfc4960") and a7288c4d ("net: sctp: improve
      sctp_select_active_and_retran_path selection") have both improved
      it towards a more fine-grained and optimal path selection.
      
      Currently, the selection algorithm for T.ACT and T.RET is as follows:
      
      1) Elect the two most recently used ACTIVE transports T1, T2 for
         T.ACT, T.RET, where T.ACT<-T1 and T1 is most recently used
      2) In case primary path T.PRI not in {T1, T2} but ACTIVE, set
         T.ACT<-T.PRI and T.RET<-T1
      3) If only T1 is ACTIVE from the set, set T.ACT<-T1 and T.RET<-T1
      4) If none is ACTIVE, set T.ACT<-best(T.PRI, T.RET, T3) where
         T3 is the most recently used (if avail) in PF, set T.RET<-T.PRI
      
      Prior to above commits, 4) was simply a camp on T.ACT<-T.PRI and
      T.RET<-T.PRI, ignoring possible paths in PF. Camping on T.PRI is
      still slightly suboptimal as it can lead to the following scenario:
      
      Setup:
              <A>                                <B>
          T1: p1p1 (10.0.10.10) <==>  .'`)  <==> p1p1 (10.0.10.12)  <= T.PRI
          T2: p1p2 (10.0.10.20) <==> (_ . ) <==> p1p2 (10.0.10.22)
      
          net.sctp.rto_min = 1000
          net.sctp.path_max_retrans = 2
          net.sctp.pf_retrans = 0
          net.sctp.hb_interval = 1000
      
      T.PRI is permanently down, T2 is put briefly into PF state (e.g. due to
      link flapping). Here, the first time transmission is sent over PF path
      T2 as it's the only non-INACTIVE path, but the retransmitted data-chunks
      are sent over the INACTIVE path T1 (T.PRI), which is not good.
      
      After the patch, it's choosing better transports in both cases by
      modifying step 4):
      
      4) If none is ACTIVE, set T.ACT_new<-best(T.ACT_old, T3) where T3 is
         the most recently used (if avail) in PF, set T.RET<-T.ACT_new
      
      This will still select a best possible path in PF if available (which
      can also include T.PRI/T.RET), and set both T.ACT/T.RET to it.
      
      In case sctp_assoc_control_transport() *just* put T.ACT_old into INACTIVE
      as it transitioned from ACTIVE->PF->INACTIVE and stays in INACTIVE just
      for a very short while before going back ACTIVE, it will guarantee that
      this path will be reselected for T.ACT/T.RET since T3 (PF) is not
      available.
      
      Previously, this was not possible, as we would only select between T.PRI
      and T.RET, and a possible T3 would be NULL due to the fact that we have
      just transitioned T3 in sctp_assoc_control_transport() from PF->INACTIVE
      and would select a suboptimal path when T.PRI/T.RET have worse properties.
      
      In the case that T.ACT_old permanently went to INACTIVE during this
      transition and there's no PF path available, plus T.PRI and T.RET are
      INACTIVE as well, we would now camp on T.ACT_old, but if everything is
      being INACTIVE there's really not much we can do except hoping for a
      successful HB to bring one of the transports back up again and, thus
      cause a new selection through sctp_assoc_control_transport().
      
      Now both tests work fine:
      
      Case 1:
      
       1. T1 S(ACTIVE) T.ACT
          T2 S(ACTIVE) T.RET
      
       2. T1 S(ACTIVE) T.ACT, T.RET
          T2 S(PF)
      
       3. T1 S(ACTIVE) T.ACT, T.RET
          T2 S(INACTIVE)
      
       5. T1 S(PF) T.ACT, T.RET
          T2 S(INACTIVE)
      
      [ 5.1 T1 S(INACTIVE) T.ACT, T.RET
            T2 S(INACTIVE) ]
      
       6. T1 S(ACTIVE) T.ACT, T.RET
          T2 S(INACTIVE)
      
       7. T1 S(ACTIVE) T.ACT
          T2 S(ACTIVE) T.RET
      
      Case 2:
      
       1. T1 S(ACTIVE) T.ACT
          T2 S(ACTIVE) T.RET
      
       2. T1 S(PF)
          T2 S(ACTIVE) T.ACT, T.RET
      
       3. T1 S(INACTIVE)
          T2 S(ACTIVE) T.ACT, T.RET
      
       5. T1 S(INACTIVE)
          T2 S(PF) T.ACT, T.RET
      
      [ 5.1 T1 S(INACTIVE)
            T2 S(INACTIVE) T.ACT, T.RET ]
      
       6. T1 S(INACTIVE)
          T2 S(ACTIVE) T.ACT, T.RET
      
       7. T1 S(ACTIVE) T.ACT
          T2 S(ACTIVE) T.RET
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa4a83ee
    • D
      net: sctp: spare unnecessary comparison in sctp_trans_elect_best · ea4f19c1
      Daniel Borkmann 提交于
      When both transports are the same, we don't have to go down that
      road only to realize that we will return the very same transport.
      We are guaranteed that curr is always non-NULL. Therefore, just
      short-circuit this special case.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea4f19c1
    • J
      openvswitch: fix panic with multiple vlan headers · 2ba5af42
      Jiri Benc 提交于
      When there are multiple vlan headers present in a received frame, the first
      one is put into vlan_tci and protocol is set to ETH_P_8021Q. Anything in the
      skb beyond the VLAN TPID may be still non-linear, including the inner TCI
      and ethertype. While ovs_flow_extract takes care of IP and IPv6 headers, it
      does nothing with ETH_P_8021Q. Later, if OVS_ACTION_ATTR_POP_VLAN is
      executed, __pop_vlan_tci pulls the next vlan header into vlan_tci.
      
      This leads to two things:
      
      1. Part of the resulting ethernet header is in the non-linear part of the
         skb. When eth_type_trans is called later as the result of
         OVS_ACTION_ATTR_OUTPUT, kernel BUGs in __skb_pull. Also, __pop_vlan_tci
         is in fact accessing random data when it reads past the TPID.
      
      2. network_header points into the ethernet header instead of behind it.
         mac_len is set to a wrong value (10), too.
      Reported-by: NYulong Pei <ypei@redhat.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ba5af42
    • B
      net: ipv6: fib: don't sleep inside atomic lock · 793c3b40
      Benjamin Block 提交于
      The function fib6_commit_metrics() allocates a piece of memory in mode
      GFP_KERNEL while holding an atomic lock from higher up in the stack, in
      the function __ip6_ins_rt(). This produces the following BUG:
      
      > BUG: sleeping function called from invalid context at mm/slub.c:1250
      > in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: dhcpcd
      > 2 locks held by dhcpcd/2909:
      >  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81978e67>] rtnl_lock+0x17/0x20
      >  #1:  (&tb->tb6_lock){++--+.}, at: [<ffffffff81a6951a>] ip6_route_add+0x65a/0x800
      > CPU: 1 PID: 2909 Comm: dhcpcd Not tainted 3.17.0-rc1 #1
      > Hardware name: ASUS All Series/Q87T, BIOS 0216 10/16/2013
      >  0000000000000008 ffff8800c8f13858 ffffffff81af135a 0000000000000000
      >  ffff880212202430 ffff8800c8f13878 ffffffff810f8d3a ffff880212202c98
      >  0000000000000010 ffff8800c8f138c8 ffffffff8121ad0e 0000000000000001
      > Call Trace:
      >  [<ffffffff81af135a>] dump_stack+0x4e/0x68
      >  [<ffffffff810f8d3a>] __might_sleep+0x10a/0x120
      >  [<ffffffff8121ad0e>] kmem_cache_alloc_trace+0x4e/0x190
      >  [<ffffffff81a6bcd6>] ? fib6_commit_metrics+0x66/0x110
      >  [<ffffffff81a6bcd6>] fib6_commit_metrics+0x66/0x110
      >  [<ffffffff81a6cbf3>] fib6_add+0x883/0xa80
      >  [<ffffffff81a6951a>] ? ip6_route_add+0x65a/0x800
      >  [<ffffffff81a69535>] ip6_route_add+0x675/0x800
      >  [<ffffffff81a68f2a>] ? ip6_route_add+0x6a/0x800
      >  [<ffffffff81a6990c>] inet6_rtm_newroute+0x5c/0x80
      >  [<ffffffff8197cf01>] rtnetlink_rcv_msg+0x211/0x260
      >  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
      >  [<ffffffff81119708>] ? lock_release_holdtime+0x28/0x180
      >  [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20
      >  [<ffffffff8197ccf0>] ? __rtnl_unlock+0x20/0x20
      >  [<ffffffff819a989e>] netlink_rcv_skb+0x6e/0xd0
      >  [<ffffffff81978ee5>] rtnetlink_rcv+0x25/0x40
      >  [<ffffffff819a8e59>] netlink_unicast+0xd9/0x180
      >  [<ffffffff819a9600>] netlink_sendmsg+0x700/0x770
      >  [<ffffffff81103735>] ? local_clock+0x25/0x30
      >  [<ffffffff8194e83c>] sock_sendmsg+0x6c/0x90
      >  [<ffffffff811f98e3>] ? might_fault+0xa3/0xb0
      >  [<ffffffff8195ca6d>] ? verify_iovec+0x7d/0xf0
      >  [<ffffffff8194ec3e>] ___sys_sendmsg+0x37e/0x3b0
      >  [<ffffffff8111ef15>] ? trace_hardirqs_on_caller+0x185/0x220
      >  [<ffffffff81af979e>] ? mutex_unlock+0xe/0x10
      >  [<ffffffff819a55ec>] ? netlink_insert+0xbc/0xe0
      >  [<ffffffff819a65e5>] ? netlink_autobind.isra.30+0x125/0x150
      >  [<ffffffff819a6520>] ? netlink_autobind.isra.30+0x60/0x150
      >  [<ffffffff819a84f9>] ? netlink_bind+0x159/0x230
      >  [<ffffffff811f989a>] ? might_fault+0x5a/0xb0
      >  [<ffffffff8194f25e>] ? SYSC_bind+0x7e/0xd0
      >  [<ffffffff8194f8cd>] __sys_sendmsg+0x4d/0x80
      >  [<ffffffff8194f912>] SyS_sendmsg+0x12/0x20
      >  [<ffffffff81afc692>] system_call_fastpath+0x16/0x1b
      
      Fixing this by replacing the mode GFP_KERNEL with GFP_ATOMIC.
      Signed-off-by: NBenjamin Block <bebl@mageta.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      793c3b40
  2. 22 8月, 2014 3 次提交
  3. 20 8月, 2014 3 次提交
  4. 17 8月, 2014 2 次提交
  5. 15 8月, 2014 6 次提交
    • T
      netlink: Annotate RCU locking for seq_file walker · 9ce12eb1
      Thomas Graf 提交于
      Silences the following sparse warnings:
      net/netlink/af_netlink.c:2926:21: warning: context imbalance in 'netlink_seq_start' - wrong count at exit
      net/netlink/af_netlink.c:2972:13: warning: context imbalance in 'netlink_seq_stop' - unexpected unlock
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ce12eb1
    • N
      tcp: fix ssthresh and undo for consecutive short FRTO episodes · 0c9ab092
      Neal Cardwell 提交于
      Fix TCP FRTO logic so that it always notices when snd_una advances,
      indicating that any RTO after that point will be a new and distinct
      loss episode.
      
      Previously there was a very specific sequence that could cause FRTO to
      fail to notice a new loss episode had started:
      
      (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
      (2) receiver ACKs packet 1
      (3) FRTO sends 2 more packets
      (4) RTO timer fires again (should start a new loss episode)
      
      The problem was in step (3) above, where tcp_process_loss() returned
      early (in the spot marked "Step 2.b"), so that it never got to the
      logic to clear icsk_retransmits. Thus icsk_retransmits stayed
      non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
      icsk_retransmits, decide that this RTO is not a new episode, and
      decide not to cut ssthresh and remember the current cwnd and ssthresh
      for undo.
      
      There were two main consequences to the bug that we have
      observed. First, ssthresh was not decreased in step (4). Second, when
      there was a series of such FRTO (1-4) sequences that happened to be
      followed by an FRTO undo, we would restore the cwnd and ssthresh from
      before the entire series started (instead of the cwnd and ssthresh
      from before the most recent RTO). This could result in cwnd and
      ssthresh being restored to values much bigger than the proper values.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c9ab092
    • H
      tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af
      Hannes Frederic Sowa 提交于
      tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
      ordering of incoming connections and teardowns without the need to
      hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
      the last measured RTO. To do so, we keep the last seen timestamp in a
      per-host indexed data structure and verify if the incoming timestamp
      in a connection request is strictly greater than the saved one during
      last connection teardown. Thus we can verify later on that no old data
      packets will be accepted by the new connection.
      
      During moving a socket to time-wait state we already verify if timestamps
      where seen on a connection. Only if that was the case we let the
      time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
      will be used. But we don't verify this on incoming SYN packets. If a
      connection teardown was less than TCP_PAWS_MSL seconds in the past we
      cannot guarantee to not accept data packets from an old connection if
      no timestamps are present. We should drop this SYN packet. This patch
      closes this loophole.
      
      Please note, this patch does not make tcp_tw_recycle in any way more
      usable but only adds another safety check:
      Sporadic drops of SYN packets because of reordering in the network or
      in the socket backlog queues can happen. Users behing NAT trying to
      connect to a tcp_tw_recycle enabled server can get caught in blackholes
      and their connection requests may regullary get dropped because hosts
      behind an address translator don't have synchronized tcp timestamp clocks.
      tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
      
      In general, use of tcp_tw_recycle is disadvised.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a26552af
    • N
      tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() · 4fab9071
      Neal Cardwell 提交于
      Make sure we use the correct address-family-specific function for
      handling MTU reductions from within tcp_release_cb().
      
      Previously AF_INET6 sockets were incorrectly always using the IPv6
      code path when sometimes they were handling IPv4 traffic and thus had
      an IPv4 dst.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Diagnosed-by: NWillem de Bruijn <willemb@google.com>
      Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications")
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fab9071
    • S
      sit: Fix ipip6_tunnel_lookup device matching criteria · bc8fc7b8
      Shmulik Ladkani 提交于
      As of 4fddbf5d ("sit: strictly restrict incoming traffic to tunnel link device"),
      when looking up a tunnel, tunnel's underlying interface (t->parms.link)
      is verified to match incoming traffic's ingress device.
      
      However the comparison was incorrectly based on skb->dev->iflink.
      
      Instead, dev->ifindex should be used, which correctly represents the
      interface from which the IP stack hands the ipip6 packets.
      
      This allows setting up sit tunnels bound to vlan interfaces (otherwise
      incoming ipip6 traffic on the vlan interface was dropped due to
      ipip6_tunnel_lookup match failure).
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc8fc7b8
    • A
      tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · 9d186cac
      Andrey Vagin 提交于
      We don't know right timestamp for repaired skb-s. Wrong RTT estimations
      isn't good, because some congestion modules heavily depends on it.
      
      This patch adds the TCPCB_REPAIRED flag, which is included in
      TCPCB_RETRANS.
      
      Thanks to Eric for the advice how to fix this issue.
      
      This patch fixes the warning:
      [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
      [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
      [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
      [  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
      [  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
      [  879.569982] Call Trace:
      [  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
      [  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
      [  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
      [  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
      [  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
      [  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
      [  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
      [  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
      [  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
      [  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
      [  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
      [  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
      [  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
      [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
      
      v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
          where it's set for other skb-s.
      
      Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d186cac
  6. 14 8月, 2014 6 次提交
  7. 12 8月, 2014 1 次提交
    • V
      net: Always untag vlan-tagged traffic on input. · 0d5501c1
      Vlad Yasevich 提交于
      Currently the functionality to untag traffic on input resides
      as part of the vlan module and is build only when VLAN support
      is enabled in the kernel.  When VLAN is disabled, the function
      vlan_untag() turns into a stub and doesn't really untag the
      packets.  This seems to create an interesting interaction
      between VMs supporting checksum offloading and some network drivers.
      
      There are some drivers that do not allow the user to change
      tx-vlan-offload feature of the driver.  These drivers also seem
      to assume that any VLAN-tagged traffic they transmit will
      have the vlan information in the vlan_tci and not in the vlan
      header already in the skb.  When transmitting skbs that already
      have tagged data with partial checksum set, the checksum doesn't
      appear to be updated correctly by the card thus resulting in a
      failure to establish TCP connections.
      
      The following is a packet trace taken on the receiver where a
      sender is a VM with a VLAN configued.  The host VM is running on
      doest not have VLAN support and the outging interface on the
      host is tg3:
      10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294837885 ecr 0,nop,wscale 7], length 0
      10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294838888 ecr 0,nop,wscale 7], length 0
      
      This connection finally times out.
      
      I've only access to the TG3 hardware in this configuration thus have
      only tested this with TG3 driver.  There are a lot of other drivers
      that do not permit user changes to vlan acceleration features, and
      I don't know if they all suffere from a similar issue.
      
      The patch attempt to fix this another way.  It moves the vlan header
      stipping code out of the vlan module and always builds it into the
      kernel network core.  This way, even if vlan is not supported on
      a virtualizatoin host, the virtual machines running on top of such
      host will still work with VLANs enabled.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: Nithin Nayak Sujir <nsujir@broadcom.com>
      CC: Michael Chan <mchan@broadcom.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d5501c1
  8. 09 8月, 2014 3 次提交
    • I
      libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly · 5f740d7e
      Ilya Dryomov 提交于
      Determining ->last_piece based on the value of ->page_offset + length
      is incorrect because length here is the length of the entire message.
      ->last_piece set to false even if page array data item length is <=
      PAGE_SIZE, which results in invalid length passed to
      ceph_tcp_{send,recv}page() and causes various asserts to fire.
      
          # cat pages-cursor-init.sh
          #!/bin/bash
          rbd create --size 10 --image-format 2 foo
          FOO_DEV=$(rbd map foo)
          dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          # rbd_resize calls librbd rbd_resize(), size is in bytes
          ./rbd_resize bar $(((4 << 20) + 512))
          rbd resize --size 10 bar
          BAR_DEV=$(rbd map bar)
          # trigger a 512-byte copyup -- 512-byte page array data item
          dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5
      
      The problem exists only in ceph_msg_data_pages_cursor_init(),
      ceph_msg_data_pages_advance() does the right thing.  The size_t cast is
      unnecessary.
      
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      5f740d7e
    • J
      rtnetlink: fix VF info size · 945a3676
      Jiri Benc 提交于
      Commit 1d8faf48 ("net/core: Add VF link state control") added new
      attribute to IFLA_VF_INFO group in rtnl_fill_ifinfo but did not adjust size
      of the allocated memory in if_nlmsg_size/rtnl_vfinfo_size. As the result, we
      may trigger warnings in rtnl_getlink and similar functions when many VF
      links are enabled, as the information does not fit into the allocated skb.
      
      Fixes: 1d8faf48 ("net/core: Add VF link state control")
      Reported-by: NYulong Pei <ypei@redhat.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      945a3676
    • N
      ipv4: removed redundant conditional · b7a71b51
      Niv Yehezkel 提交于
      Since fib_lookup cannot return ESRCH no longer,
      checking for this error code is no longer neccesary.
      Signed-off-by: NNiv Yehezkel <executerx@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7a71b51