1. 01 1月, 2015 1 次提交
  2. 24 12月, 2014 1 次提交
  3. 19 12月, 2014 2 次提交
  4. 17 12月, 2014 2 次提交
  5. 16 12月, 2014 1 次提交
    • T
      gre: fix the inner mac header in nbma tunnel xmit path · 8a0033a9
      Timo Teräs 提交于
      The NBMA GRE tunnels temporarily push GRE header that contain the
      per-packet NBMA destination on the skb via header ops early in xmit
      path. It is the later pulled before the real GRE header is constructed.
      
      The inner mac was thus set differently in nbma case: the GRE header
      has been pushed by neighbor layer, and mac header points to beginning
      of the temporary gre header (set by dev_queue_xmit).
      
      Now that the offloads expect mac header to point to the gre payload,
      fix the xmit patch to:
       - pull first the temporary gre header away
       - and reset mac header to point to gre payload
      
      This fixes tso to work again with nbma tunnels.
      
      Fixes: 14051f04 ("gre: Use inner mac length when computing tunnel length")
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a0033a9
  6. 12 12月, 2014 1 次提交
    • A
      fib_trie: Fix trie balancing issue if new node pushes down existing node · e962f302
      Alexander Duyck 提交于
      This patch addresses an issue with the level compression of the fib_trie.
      Specifically in the case of adding a new leaf that triggers a new node to
      be added that takes the place of the old node.  The result is a trie where
      the 1 child tnode is on one side and one leaf is on the other which gives
      you a very deep trie.  Below is the script I used to generate a trie on
      dummy0 with a 10.X.X.X family of addresses.
      
        ip link add type dummy
        ipval=184549374
        bit=2
        for i in `seq 1 23`
        do
          ifconfig dummy0:$bit $ipval/8
          ipval=`expr $ipval - $bit`
          bit=`expr $bit \* 2`
        done
        cat /proc/net/fib_triestat
      
      Running the script before the patch:
      
      	Local:
      		Aver depth:     10.82
      		Max depth:      23
      		Leaves:         29
      		Prefixes:       30
      		Internal nodes: 27
      		  1: 26  2: 1
      		Pointers: 56
      	Null ptrs: 1
      	Total size: 5  kB
      
      After applying the patch and repeating:
      
      	Local:
      		Aver depth:     4.72
      		Max depth:      9
      		Leaves:         29
      		Prefixes:       30
      		Internal nodes: 12
      		  1: 3  2: 2  3: 7
      		Pointers: 70
      	Null ptrs: 30
      	Total size: 4  kB
      
      What this fix does is start the rebalance at the newly created tnode
      instead of at the parent tnode.  This way if there is a gap between the
      parent and the new node it doesn't prevent the new tnode from being
      coalesced with any pre-existing nodes that may have been pushed into one
      of the new nodes child branches.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e962f302
  7. 11 12月, 2014 2 次提交
    • G
      net: introduce helper macro for_each_cmsghdr · f95b414e
      Gu Zheng 提交于
      Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
      cmsghdr from msghdr, just cleanup.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95b414e
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  8. 10 12月, 2014 8 次提交
    • E
      tcp: fix more NULL deref after prequeue changes · 0f85feae
      Eric Dumazet 提交于
      When I cooked commit c3658e8d ("tcp: fix possible NULL dereference in
      tcp_vX_send_reset()") I missed other spots we could deref a NULL
      skb_dst(skb)
      
      Again, if a socket is provided, we do not need skb_dst() to get a
      pointer to network namespace : sock_net(sk) is good enough.
      Reported-by: NDann Frazier <dann.frazier@canonical.com>
      Bisected-by: NDann Frazier <dann.frazier@canonical.com>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: ca777eff ("tcp: remove dst refcount false sharing for prequeue mode")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f85feae
    • E
      tcp: refine TSO autosizing · 605ad7f1
      Eric Dumazet 提交于
      Commit 95bd09eb ("tcp: TSO packets automatic sizing") tried to
      control TSO size, but did this at the wrong place (sendmsg() time)
      
      At sendmsg() time, we might have a pessimistic view of flow rate,
      and we end up building very small skbs (with 2 MSS per skb).
      
      This is bad because :
      
       - It sends small TSO packets even in Slow Start where rate quickly
         increases.
       - It tends to make socket write queue very big, increasing tcp_ack()
         processing time, but also increasing memory needs, not necessarily
         accounted for, as fast clones overhead is currently ignored.
       - Lower GRO efficiency and more ACK packets.
      
      Servers with a lot of small lived connections suffer from this.
      
      Lets instead fill skbs as much as possible (64KB of payload), but split
      them at xmit time, when we have a precise idea of the flow rate.
      skb split is actually quite efficient.
      
      Patch looks bigger than necessary, because TCP Small Queue decision now
      has to take place after the eventual split.
      
      As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
      tcp_tso_should_defer() can be synchronized on same goal.
      
      Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
      contains number of mss that we can put in GSO packet, and is not
      related to the autosizing goal anymore.
      
      Tested:
      
      40 ms rtt link
      
      nstat >/dev/null
      netperf -H remote -l -2000000 -- -s 1000000
      nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
      
      Before patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/s
      
       87380 2000000 2000000    0.36         44.22
      IpInReceives                    600                0.0
      IpOutRequests                   599                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2033249            0.0
      
      After patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380 2000000 2000000    0.36       44.27
      IpInReceives                    221                0.0
      IpOutRequests                   232                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2013953            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605ad7f1
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
    • A
      f4362a2c
    • A
      f69e6d13
    • A
      raw.c: stick msghdr into raw_frag_vec · b61e9dcc
      Al Viro 提交于
      we'll want access to ->msg_iter
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b61e9dcc
    • E
      tcp_cubic: refine Hystart delay threshold · 42eef7a0
      Eric Dumazet 提交于
      In commit 2b4636a5 ("tcp_cubic: make the delay threshold of HyStart
      less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.
      
      The remaining problem is that using delay_min + (delay_min/16) as the
      threshold is too sensitive.
      
      6.25 % of variation is too small for rtt above 60 ms, which are not
      uncommon.
      
      Lets use 12.5 % instead (delay_min + (delay_min/8))
      
      Tested:
       80 ms RTT between peers, FQ/pacing packet scheduler on sender.
       10 bulk transfers of 10 seconds :
      
      nstat >/dev/null
      for i in `seq 1 10`
       do
         netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
       done
      nstat | grep Hystart
      
      With the 6.25 % threshold :
      
      THROUGHPUT=20.66
      THROUGHPUT=249.38
      THROUGHPUT=254.10
      THROUGHPUT=14.94
      THROUGHPUT=251.92
      THROUGHPUT=237.73
      THROUGHPUT=19.18
      THROUGHPUT=252.89
      THROUGHPUT=21.32
      THROUGHPUT=15.58
      TcpExtTCPHystartTrainDetect     2                  0.0
      TcpExtTCPHystartTrainCwnd       4756               0.0
      TcpExtTCPHystartDelayDetect     5                  0.0
      TcpExtTCPHystartDelayCwnd       180                0.0
      
      With the 12.5 % threshold
      THROUGHPUT=251.09
      THROUGHPUT=247.46
      THROUGHPUT=250.92
      THROUGHPUT=248.91
      THROUGHPUT=250.88
      THROUGHPUT=249.84
      THROUGHPUT=250.51
      THROUGHPUT=254.15
      THROUGHPUT=250.62
      THROUGHPUT=250.89
      TcpExtTCPHystartTrainDetect     1                  0.0
      TcpExtTCPHystartTrainCwnd       3175               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42eef7a0
    • E
      tcp_cubic: add SNMP counters to track how effective is Hystart · 6e3a8a93
      Eric Dumazet 提交于
      When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
      triggers too soon.
      
      Having SNMP counters to have an idea of how often the various Hystart
      methods trigger is useful prior to any modifications.
      
      This patch adds SNMP counters tracking, how many time "ack train" or
      "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
      Hystart decided to end SS (Slow Start)
      
      myhost:~# nstat -a | grep Hystart
      TcpExtTCPHystartTrainDetect     9                  0.0
      TcpExtTCPHystartTrainCwnd       20650              0.0
      TcpExtTCPHystartDelayDetect     10                 0.0
      TcpExtTCPHystartDelayCwnd       360                0.0
      
      ->
       Train detection was triggered 9 times, and average cwnd was
       20650/9=2294,
       Delay detection was triggered 10 times and average cwnd was 36
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e3a8a93
  9. 09 12月, 2014 3 次提交
    • J
      udp: Neaten and reduce size of compute_score functions · 60c04aec
      Joe Perches 提交于
      The compute_score functions are a bit difficult to read.
      
      Neaten them a bit to reduce object sizes and make them a
      bit more intelligible.
      
      Return early to avoid indentation and avoid unnecessary
      initializations.
      
      (allyesconfig, but w/ -O2 and no profiling)
      
      $ size net/ipv[46]/udp.o.*
         text    data     bss     dec     hex filename
        28680    1184      25   29889    74c1 net/ipv4/udp.o.new
        28756    1184      25   29965    750d net/ipv4/udp.o.old
        17600    1010       2   18612    48b4 net/ipv6/udp.o.new
        17632    1010       2   18644    48d4 net/ipv6/udp.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60c04aec
    • W
      net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6
      Willem de Bruijn 提交于
      Allow reading of timestamps and cmsg at the same time on all relevant
      socket families. One use is to correlate timestamps with egress
      device, by asking for cmsg IP_PKTINFO.
      
      on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
      avoid changing legacy expectations, only do so if the caller sets a
      new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
      
      on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
      returned for all origins. only change is to set ifindex, which is
      not initialized for all error origins.
      
      In both cases, only generate the pktinfo message if an ifindex is
      known. This is not the case for ACK timestamps.
      
      The difference between the protocol families is probably a historical
      accident as a result of the different conditions for generating cmsg
      in the relevant ip(v6)_recv_error function:
      
      ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
      ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
      
      At one time, this was the same test bar for the ICMP/ICMP6
      distinction. This is no longer true.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1 -> v2
          large rewrite
          - integrate with existing pktinfo cmsg generation code
          - on ipv4: only send with new flag, to maintain legacy behavior
          - on ipv6: send at most a single pktinfo cmsg
          - on ipv6: initialize fields if not yet initialized
      
      The recv cmsg interfaces are also relevant to the discussion of
      whether looping packet headers is problematic. For v6, cmsgs that
      identify many headers are already returned. This patch expands
      that to v4. If it sounds reasonable, I will follow with patches
      
      1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
         (http://patchwork.ozlabs.org/patch/366967/)
      2. sysctl to conditionally drop all timestamps that have payload or
         cmsg from users without CAP_NET_RAW.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829ae9d6
    • W
      ipv4: warn once on passing AF_INET6 socket to ip_recv_error · 7ce875e5
      Willem de Bruijn 提交于
      One line change, in response to catching an occurrence of this bug.
      See also fix f4713a3d ("net-timestamp: make tcp_recvmsg call ...")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce875e5
  10. 06 12月, 2014 1 次提交
  11. 27 11月, 2014 3 次提交
  12. 26 11月, 2014 1 次提交
  13. 25 11月, 2014 1 次提交
    • J
      net/ping: handle protocol mismatching scenario · 91a0b603
      Jane Zhou 提交于
      ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
      dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
      sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
      sock will be returned.
      the fix is to "continue" the searching, if no matching, return NULL.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NJane Zhou <a17711@motorola.com>
      Signed-off-by: NYiwei Zhao <gbjc64@motorola.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a0b603
  14. 24 11月, 2014 4 次提交
    • A
      new helper: memcpy_to_msg() · 7eab8d9e
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7eab8d9e
    • A
      new helper: memcpy_from_msg() · 6ce8e9ce
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6ce8e9ce
    • A
      new helper: skb_copy_and_csum_datagram_msg() · 227158db
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      227158db
    • L
      ip_tunnel: the lack of vti_link_ops' dellink() cause kernel panic · 20ea60ca
      lucien 提交于
      Now the vti_link_ops do not point the .dellink, for fb tunnel device
      (ip_vti0), the net_device will be removed as the default .dellink is
      unregister_netdevice_queue,but the tunnel still in the tunnel list,
      then if we add a new vti tunnel, in ip_tunnel_find():
      
              hlist_for_each_entry_rcu(t, head, hash_node) {
                      if (local == t->parms.iph.saddr &&
                          remote == t->parms.iph.daddr &&
                          link == t->parms.link &&
      ==>                 type == t->dev->type &&
                          ip_tunnel_key_match(&t->parms, flags, key))
                              break;
              }
      
      the panic will happen, cause dev of ip_tunnel *t is null:
      [ 3835.072977] IP: [<ffffffffa04103fd>] ip_tunnel_find+0x9d/0xc0 [ip_tunnel]
      [ 3835.073008] PGD b2c21067 PUD b7277067 PMD 0
      [ 3835.073008] Oops: 0000 [#1] SMP
      .....
      [ 3835.073008] Stack:
      [ 3835.073008]  ffff8800b72d77f0 ffffffffa0411924 ffff8800bb956000 ffff8800b72d78e0
      [ 3835.073008]  ffff8800b72d78a0 0000000000000000 ffffffffa040d100 ffff8800b72d7858
      [ 3835.073008]  ffffffffa040b2e3 0000000000000000 0000000000000000 0000000000000000
      [ 3835.073008] Call Trace:
      [ 3835.073008]  [<ffffffffa0411924>] ip_tunnel_newlink+0x64/0x160 [ip_tunnel]
      [ 3835.073008]  [<ffffffffa040b2e3>] vti_newlink+0x43/0x70 [ip_vti]
      [ 3835.073008]  [<ffffffff8150d4da>] rtnl_newlink+0x4fa/0x5f0
      [ 3835.073008]  [<ffffffff812f68bb>] ? nla_strlcpy+0x5b/0x70
      [ 3835.073008]  [<ffffffff81508fb0>] ? rtnl_link_ops_get+0x40/0x60
      [ 3835.073008]  [<ffffffff8150d11f>] ? rtnl_newlink+0x13f/0x5f0
      [ 3835.073008]  [<ffffffff81509cf4>] rtnetlink_rcv_msg+0xa4/0x270
      [ 3835.073008]  [<ffffffff8126adf5>] ? sock_has_perm+0x75/0x90
      [ 3835.073008]  [<ffffffff81509c50>] ? rtnetlink_rcv+0x30/0x30
      [ 3835.073008]  [<ffffffff81529e39>] netlink_rcv_skb+0xa9/0xc0
      [ 3835.073008]  [<ffffffff81509c48>] rtnetlink_rcv+0x28/0x30
      ....
      
      modprobe ip_vti
      ip link del ip_vti0 type vti
      ip link add ip_vti0 type vti
      rmmod ip_vti
      
      do that one or more times, kernel will panic.
      
      fix it by assigning ip_tunnel_dellink to vti_link_ops' dellink, in
      which we skip the unregister of fb tunnel device. do the same on ip6_vti.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20ea60ca
  15. 22 11月, 2014 3 次提交
    • C
      tcp: Restore RFC5961-compliant behavior for SYN packets · 0c228e83
      Calvin Owens 提交于
      Commit c3ae62af ("tcp: should drop incoming frames without ACK
      flag set") was created to mitigate a security vulnerability in which a
      local attacker is able to inject data into locally-opened sockets by
      using TCP protocol statistics in procfs to quickly find the correct
      sequence number.
      
      This broke the RFC5961 requirement to send a challenge ACK in response
      to spurious RST packets, which was subsequently fixed by commit
      7b514a88 ("tcp: accept RST without ACK flag").
      
      Unfortunately, the RFC5961 requirement that spurious SYN packets be
      handled in a similar manner remains broken.
      
      RFC5961 section 4 states that:
      
         ... the handling of the SYN in the synchronized state SHOULD be
         performed as follows:
      
         1) If the SYN bit is set, irrespective of the sequence number, TCP
            MUST send an ACK (also referred to as challenge ACK) to the remote
            peer:
      
            <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
      
            After sending the acknowledgment, TCP MUST drop the unacceptable
            segment and stop processing further.
      
         By sending an ACK, the remote peer is challenged to confirm the loss
         of the previous connection and the request to start a new connection.
         A legitimate peer, after restart, would not have a TCB in the
         synchronized state.  Thus, when the ACK arrives, the peer should send
         a RST segment back with the sequence number derived from the ACK
         field that caused the RST.
      
         This RST will confirm that the remote peer has indeed closed the
         previous connection.  Upon receipt of a valid RST, the local TCP
         endpoint MUST terminate its connection.  The local TCP endpoint
         should then rely on SYN retransmission from the remote end to
         re-establish the connection.
      
      This patch lets SYN packets through the discard added in c3ae62af,
      so that spurious SYN packets are properly dealt with as per the RFC.
      
      The challenge ACK is sent unconditionally and is rate-limited, so the
      original vulnerability is not reintroduced by this patch.
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c228e83
    • J
      vlan: introduce *vlan_hwaccel_push_inside helpers · 5968250c
      Jiri Pirko 提交于
      Use them to push skb->vlan_tci into the payload and avoid code
      duplication.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5968250c
    • J
      vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto · 62749e2c
      Jiri Pirko 提交于
      Name fits better. Plus there's going to be introduced
      __vlan_insert_tag later on.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62749e2c
  16. 20 11月, 2014 1 次提交
    • E
      tcp: make connect() mem charging friendly · 355a901e
      Eric Dumazet 提交于
      While working on sk_forward_alloc problems reported by Denys
      Fedoryshchenko, we found that tcp connect() (and fastopen) do not call
      sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so
      sk_forward_alloc is negative while connect is in progress.
      
      We can fix this by calling regular sk_stream_alloc_skb() both for the
      SYN packet (in tcp_connect()) and the syn_data packet in
      tcp_send_syn_data()
      
      Then, tcp_send_syn_data() can avoid copying syn_data as we simply
      can manipulate syn_data->cb[] to remove SYN flag (and increment seq)
      
      Instead of open coding memcpy_fromiovecend(), simply use this helper.
      
      This leaves in socket write queue clean fast clone skbs.
      
      This was tested against our fastopen packetdrill tests.
      Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      355a901e
  17. 19 11月, 2014 1 次提交
  18. 17 11月, 2014 2 次提交
    • D
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · feb91a02
      Daniel Borkmann 提交于
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: NWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      feb91a02
    • P
      ipv4: Fix incorrect error code when adding an unreachable route · 49dd18ba
      Panu Matilainen 提交于
      Trying to add an unreachable route incorrectly returns -ESRCH if
      if custom FIB rules are present:
      
      [root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4
      RTNETLINK answers: Network is unreachable
      [root@localhost ~]# ip rule add to 55.66.77.88 table 200
      [root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4
      RTNETLINK answers: No such process
      [root@localhost ~]#
      
      Commit 83886b6b ("[NET]: Change "not found"
      return value for rule lookup") changed fib_rules_lookup()
      to use -ESRCH as a "not found" code internally, but for user space it
      should be translated into -ENETUNREACH. Handle the translation centrally in
      ipv4-specific fib_lookup(), leaving the DECnet case alone.
      
      On a related note, commit b7a71b51
      ("ipv4: removed redundant conditional") removed a similar translation from
      ip_route_input_slow() prematurely AIUI.
      
      Fixes: b7a71b51 ("ipv4: removed redundant conditional")
      Signed-off-by: NPanu Matilainen <pmatilai@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49dd18ba
  19. 14 11月, 2014 2 次提交
    • E
      tcp: limit GSO packets to half cwnd · d649a7a8
      Eric Dumazet 提交于
      In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
      big, as sk_pacing_rate is high.
      
      When network is congested, cwnd can be smaller than the GSO packets
      found in socket write queue. tcp_write_xmit() splits GSO packets
      using the available cwnd, and we end up sending a single GSO packet,
      consuming all available cwnd.
      
      With GRO aggregation on the receiver, we might handle a single GRO
      packet, sending back a single ACK.
      
      1) This single ACK might be lost
         TLP or RTO are forced to attempt a retransmit.
      2) This ACK releases a full cwnd, sender sends another big GSO packet,
         in a ping pong mode.
      
      This behavior does not fill the pipes in the best way, because of
      scheduling artifacts.
      
      Make sure we always have at least two GSO packets in flight.
      
      This allows us to safely increase GRO efficiency without risking
      spurious retransmits.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d649a7a8
    • T
      FOU: Fix no return statement warning for !CONFIG_NET_FOU_IP_TUNNELS · 882288c0
      Thomas Graf 提交于
      net/ipv4/fou.c: In function ‘ip_tunnel_encap_del_fou_ops’:
      net/ipv4/fou.c:861:1: warning: no return statement in function returning non-void [-Wreturn-type]
      
      Fixes: a8c5f90f ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      882288c0