1. 01 1月, 2015 9 次提交
    • A
      fib_trie: Update meaning of pos to represent unchecked bits · e9b44019
      Alexander Duyck 提交于
      This change moves the pos value to the other side of the "bits" field.  By
      doing this it actually simplifies a significant amount of code in the trie.
      
      For example when halving a tree we know that the bit lost exists at
      oldnode->pos, and if we inflate the tree the new bit being add is at
      tn->pos.  Previously to find those bits you would have to subtract pos and
      bits from the keylength or start with a value of (1 << 31) and then shift
      that.
      
      There are a number of spots throughout the code that benefit from this.  In
      the case of the hot-path searches the main advantage is that we can drop 2
      or more operations from the search path as we no longer need to compute the
      value for the index to be shifted by and can instead just use the raw pos
      value.
      
      In addition the tkey_extract_bits is now defunct and can be replaced by
      get_index since the two operations were doing the same thing, but now
      get_index does it much more quickly as it is only an xor and shift versus a
      pair of shifts and a subtraction.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9b44019
    • A
      fib_trie: Optimize fib_table_insert · 836a0123
      Alexander Duyck 提交于
      This patch updates the fib_table_insert function to take advantage of the
      changes made to improve the performance of fib_table_lookup.  As a result
      the code should be smaller and run faster then the original.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      836a0123
    • A
      fib_trie: Optimize fib_find_node · 939afb06
      Alexander Duyck 提交于
      This patch makes use of the same features I made use of for
      fib_table_lookup to streamline fib_find_node.  The resultant code should be
      smaller and run faster than the original.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      939afb06
    • A
      fib_trie: Optimize fib_table_lookup to avoid wasting time on loops/variables · 9f9e636d
      Alexander Duyck 提交于
      This patch is meant to reduce the complexity of fib_table_lookup by reducing
      the number of variables to the bare minimum while still keeping the same if
      not improved functionality versus the original.
      
      Most of this change was started off by the desire to rid the function of
      chopped_off and current_prefix_length as they actually added very little to
      the function since they only applied when computing the cindex.  I was able
      to replace them mostly with just a check for the prefix match.  As long as
      the prefix between the key and the node being tested was the same we know
      we can search the tnode fully versus just testing cindex 0.
      
      The second portion of the change ended up being a massive reordering.
      Originally the calls to check_leaf were up near the start of the loop, and
      the backtracing and descending into lower levels of tnodes was later.  This
      didn't make much sense as the structure of the tree means the leaves are
      always the last thing to be tested.  As such I reordered things so that we
      instead have a loop that will delve into the tree and only exit when we
      have either found a leaf or we have exhausted the tree.  The advantage of
      rearranging things like this is that we can fully inline check_leaf since
      there is now only one reference to it in the function.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9e636d
    • A
      fib_trie: Merge leaf into tnode · adaf9816
      Alexander Duyck 提交于
      This change makes it so that leaf and tnode are the same struct.  As a
      result there is no need for rt_trie_node anymore since everyting can be
      merged into tnode.
      
      On 32b systems this results in the leaf being 4 bytes larger, however I
      don't know if that is really an issue as this and an eariler patch that
      added bits & pos have increased the size from 20 to 28.  If I am not
      mistaken slub/slab allocate on power of 2 sizes so 20 was likely being
      rounded up to 32 anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adaf9816
    • A
      fib_trie: Merge tnode_free and leaf_free into node_free · 37fd30f2
      Alexander Duyck 提交于
      Both the leaf and the tnode had an rcu_head in them, but they had them in
      slightly different places.  Since we now have them in the same spot and
      know that any node with bits == 0 is a leaf and the rest are either vmalloc
      or kmalloc tnodes depending on the value of bits it makes it easy to combine
      the functions and reduce overhead.
      
      In addition I have taken advantage of the rcu_head pointer to go ahead and
      put together a simple linked list instead of using the tnode pointer as
      this way we can merge either type of structure for freeing.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37fd30f2
    • A
      fib_trie: Make leaf and tnode more uniform · 64c9b6fb
      Alexander Duyck 提交于
      This change makes some fundamental changes to the way leaves and tnodes are
      constructed.  The big differences are:
      1.  Leaves now populate pos and bits indicating their full key size.
      2.  Trie nodes now mask out their lower bits to be consistent with the leaf
      3.  Both structures have been reordered so that rt_trie_node now consisists
          of a much larger region including the pos, bits, and rcu portions of
          the tnode structure.
      
      On 32b systems this will result in the leaf being 4B larger as the pos and
      bits values were added to a hole created by the key as it was only 4B in
      length.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64c9b6fb
    • A
      fib_trie: Update usage stats to be percpu instead of global variables · 8274a97a
      Alexander Duyck 提交于
      The trie usage stats were currently being shared by all threads that were
      calling fib_table_lookup.  As a result when multiple threads were
      performing lookups simultaneously the trie would begin to cache bounce
      between those threads.
      
      In order to prevent this I have updated the usage stats to use a set of
      percpu variables.  By doing this we should be able to avoid the cache
      bouncing and still make use of these stats.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8274a97a
    • S
      gre: allow live address change · bec94d43
      stephen hemminger 提交于
      The GRE tap device supports Ethernet over GRE, but doesn't
      care about the source address of the tunnel, therefore it
      can be changed without bring device down.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec94d43
  2. 24 12月, 2014 1 次提交
  3. 19 12月, 2014 2 次提交
  4. 17 12月, 2014 2 次提交
  5. 16 12月, 2014 1 次提交
    • T
      gre: fix the inner mac header in nbma tunnel xmit path · 8a0033a9
      Timo Teräs 提交于
      The NBMA GRE tunnels temporarily push GRE header that contain the
      per-packet NBMA destination on the skb via header ops early in xmit
      path. It is the later pulled before the real GRE header is constructed.
      
      The inner mac was thus set differently in nbma case: the GRE header
      has been pushed by neighbor layer, and mac header points to beginning
      of the temporary gre header (set by dev_queue_xmit).
      
      Now that the offloads expect mac header to point to the gre payload,
      fix the xmit patch to:
       - pull first the temporary gre header away
       - and reset mac header to point to gre payload
      
      This fixes tso to work again with nbma tunnels.
      
      Fixes: 14051f04 ("gre: Use inner mac length when computing tunnel length")
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a0033a9
  6. 12 12月, 2014 1 次提交
    • A
      fib_trie: Fix trie balancing issue if new node pushes down existing node · e962f302
      Alexander Duyck 提交于
      This patch addresses an issue with the level compression of the fib_trie.
      Specifically in the case of adding a new leaf that triggers a new node to
      be added that takes the place of the old node.  The result is a trie where
      the 1 child tnode is on one side and one leaf is on the other which gives
      you a very deep trie.  Below is the script I used to generate a trie on
      dummy0 with a 10.X.X.X family of addresses.
      
        ip link add type dummy
        ipval=184549374
        bit=2
        for i in `seq 1 23`
        do
          ifconfig dummy0:$bit $ipval/8
          ipval=`expr $ipval - $bit`
          bit=`expr $bit \* 2`
        done
        cat /proc/net/fib_triestat
      
      Running the script before the patch:
      
      	Local:
      		Aver depth:     10.82
      		Max depth:      23
      		Leaves:         29
      		Prefixes:       30
      		Internal nodes: 27
      		  1: 26  2: 1
      		Pointers: 56
      	Null ptrs: 1
      	Total size: 5  kB
      
      After applying the patch and repeating:
      
      	Local:
      		Aver depth:     4.72
      		Max depth:      9
      		Leaves:         29
      		Prefixes:       30
      		Internal nodes: 12
      		  1: 3  2: 2  3: 7
      		Pointers: 70
      	Null ptrs: 30
      	Total size: 4  kB
      
      What this fix does is start the rebalance at the newly created tnode
      instead of at the parent tnode.  This way if there is a gap between the
      parent and the new node it doesn't prevent the new tnode from being
      coalesced with any pre-existing nodes that may have been pushed into one
      of the new nodes child branches.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e962f302
  7. 11 12月, 2014 2 次提交
    • G
      net: introduce helper macro for_each_cmsghdr · f95b414e
      Gu Zheng 提交于
      Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
      cmsghdr from msghdr, just cleanup.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95b414e
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  8. 10 12月, 2014 8 次提交
    • E
      tcp: fix more NULL deref after prequeue changes · 0f85feae
      Eric Dumazet 提交于
      When I cooked commit c3658e8d ("tcp: fix possible NULL dereference in
      tcp_vX_send_reset()") I missed other spots we could deref a NULL
      skb_dst(skb)
      
      Again, if a socket is provided, we do not need skb_dst() to get a
      pointer to network namespace : sock_net(sk) is good enough.
      Reported-by: NDann Frazier <dann.frazier@canonical.com>
      Bisected-by: NDann Frazier <dann.frazier@canonical.com>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: ca777eff ("tcp: remove dst refcount false sharing for prequeue mode")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f85feae
    • E
      tcp: refine TSO autosizing · 605ad7f1
      Eric Dumazet 提交于
      Commit 95bd09eb ("tcp: TSO packets automatic sizing") tried to
      control TSO size, but did this at the wrong place (sendmsg() time)
      
      At sendmsg() time, we might have a pessimistic view of flow rate,
      and we end up building very small skbs (with 2 MSS per skb).
      
      This is bad because :
      
       - It sends small TSO packets even in Slow Start where rate quickly
         increases.
       - It tends to make socket write queue very big, increasing tcp_ack()
         processing time, but also increasing memory needs, not necessarily
         accounted for, as fast clones overhead is currently ignored.
       - Lower GRO efficiency and more ACK packets.
      
      Servers with a lot of small lived connections suffer from this.
      
      Lets instead fill skbs as much as possible (64KB of payload), but split
      them at xmit time, when we have a precise idea of the flow rate.
      skb split is actually quite efficient.
      
      Patch looks bigger than necessary, because TCP Small Queue decision now
      has to take place after the eventual split.
      
      As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
      tcp_tso_should_defer() can be synchronized on same goal.
      
      Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
      contains number of mss that we can put in GSO packet, and is not
      related to the autosizing goal anymore.
      
      Tested:
      
      40 ms rtt link
      
      nstat >/dev/null
      netperf -H remote -l -2000000 -- -s 1000000
      nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
      
      Before patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/s
      
       87380 2000000 2000000    0.36         44.22
      IpInReceives                    600                0.0
      IpOutRequests                   599                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2033249            0.0
      
      After patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380 2000000 2000000    0.36       44.27
      IpInReceives                    221                0.0
      IpOutRequests                   232                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2013953            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605ad7f1
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
    • A
      f4362a2c
    • A
      f69e6d13
    • A
      raw.c: stick msghdr into raw_frag_vec · b61e9dcc
      Al Viro 提交于
      we'll want access to ->msg_iter
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b61e9dcc
    • E
      tcp_cubic: refine Hystart delay threshold · 42eef7a0
      Eric Dumazet 提交于
      In commit 2b4636a5 ("tcp_cubic: make the delay threshold of HyStart
      less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.
      
      The remaining problem is that using delay_min + (delay_min/16) as the
      threshold is too sensitive.
      
      6.25 % of variation is too small for rtt above 60 ms, which are not
      uncommon.
      
      Lets use 12.5 % instead (delay_min + (delay_min/8))
      
      Tested:
       80 ms RTT between peers, FQ/pacing packet scheduler on sender.
       10 bulk transfers of 10 seconds :
      
      nstat >/dev/null
      for i in `seq 1 10`
       do
         netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
       done
      nstat | grep Hystart
      
      With the 6.25 % threshold :
      
      THROUGHPUT=20.66
      THROUGHPUT=249.38
      THROUGHPUT=254.10
      THROUGHPUT=14.94
      THROUGHPUT=251.92
      THROUGHPUT=237.73
      THROUGHPUT=19.18
      THROUGHPUT=252.89
      THROUGHPUT=21.32
      THROUGHPUT=15.58
      TcpExtTCPHystartTrainDetect     2                  0.0
      TcpExtTCPHystartTrainCwnd       4756               0.0
      TcpExtTCPHystartDelayDetect     5                  0.0
      TcpExtTCPHystartDelayCwnd       180                0.0
      
      With the 12.5 % threshold
      THROUGHPUT=251.09
      THROUGHPUT=247.46
      THROUGHPUT=250.92
      THROUGHPUT=248.91
      THROUGHPUT=250.88
      THROUGHPUT=249.84
      THROUGHPUT=250.51
      THROUGHPUT=254.15
      THROUGHPUT=250.62
      THROUGHPUT=250.89
      TcpExtTCPHystartTrainDetect     1                  0.0
      TcpExtTCPHystartTrainCwnd       3175               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42eef7a0
    • E
      tcp_cubic: add SNMP counters to track how effective is Hystart · 6e3a8a93
      Eric Dumazet 提交于
      When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
      triggers too soon.
      
      Having SNMP counters to have an idea of how often the various Hystart
      methods trigger is useful prior to any modifications.
      
      This patch adds SNMP counters tracking, how many time "ack train" or
      "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
      Hystart decided to end SS (Slow Start)
      
      myhost:~# nstat -a | grep Hystart
      TcpExtTCPHystartTrainDetect     9                  0.0
      TcpExtTCPHystartTrainCwnd       20650              0.0
      TcpExtTCPHystartDelayDetect     10                 0.0
      TcpExtTCPHystartDelayCwnd       360                0.0
      
      ->
       Train detection was triggered 9 times, and average cwnd was
       20650/9=2294,
       Delay detection was triggered 10 times and average cwnd was 36
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e3a8a93
  9. 09 12月, 2014 3 次提交
    • J
      udp: Neaten and reduce size of compute_score functions · 60c04aec
      Joe Perches 提交于
      The compute_score functions are a bit difficult to read.
      
      Neaten them a bit to reduce object sizes and make them a
      bit more intelligible.
      
      Return early to avoid indentation and avoid unnecessary
      initializations.
      
      (allyesconfig, but w/ -O2 and no profiling)
      
      $ size net/ipv[46]/udp.o.*
         text    data     bss     dec     hex filename
        28680    1184      25   29889    74c1 net/ipv4/udp.o.new
        28756    1184      25   29965    750d net/ipv4/udp.o.old
        17600    1010       2   18612    48b4 net/ipv6/udp.o.new
        17632    1010       2   18644    48d4 net/ipv6/udp.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60c04aec
    • W
      net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6
      Willem de Bruijn 提交于
      Allow reading of timestamps and cmsg at the same time on all relevant
      socket families. One use is to correlate timestamps with egress
      device, by asking for cmsg IP_PKTINFO.
      
      on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
      avoid changing legacy expectations, only do so if the caller sets a
      new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
      
      on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
      returned for all origins. only change is to set ifindex, which is
      not initialized for all error origins.
      
      In both cases, only generate the pktinfo message if an ifindex is
      known. This is not the case for ACK timestamps.
      
      The difference between the protocol families is probably a historical
      accident as a result of the different conditions for generating cmsg
      in the relevant ip(v6)_recv_error function:
      
      ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
      ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
      
      At one time, this was the same test bar for the ICMP/ICMP6
      distinction. This is no longer true.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1 -> v2
          large rewrite
          - integrate with existing pktinfo cmsg generation code
          - on ipv4: only send with new flag, to maintain legacy behavior
          - on ipv6: send at most a single pktinfo cmsg
          - on ipv6: initialize fields if not yet initialized
      
      The recv cmsg interfaces are also relevant to the discussion of
      whether looping packet headers is problematic. For v6, cmsgs that
      identify many headers are already returned. This patch expands
      that to v4. If it sounds reasonable, I will follow with patches
      
      1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
         (http://patchwork.ozlabs.org/patch/366967/)
      2. sysctl to conditionally drop all timestamps that have payload or
         cmsg from users without CAP_NET_RAW.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829ae9d6
    • W
      ipv4: warn once on passing AF_INET6 socket to ip_recv_error · 7ce875e5
      Willem de Bruijn 提交于
      One line change, in response to catching an occurrence of this bug.
      See also fix f4713a3d ("net-timestamp: make tcp_recvmsg call ...")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce875e5
  10. 06 12月, 2014 1 次提交
  11. 27 11月, 2014 3 次提交
  12. 26 11月, 2014 1 次提交
  13. 25 11月, 2014 1 次提交
    • J
      net/ping: handle protocol mismatching scenario · 91a0b603
      Jane Zhou 提交于
      ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
      dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
      sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
      sock will be returned.
      the fix is to "continue" the searching, if no matching, return NULL.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NJane Zhou <a17711@motorola.com>
      Signed-off-by: NYiwei Zhao <gbjc64@motorola.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a0b603
  14. 24 11月, 2014 4 次提交
    • A
      new helper: memcpy_to_msg() · 7eab8d9e
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7eab8d9e
    • A
      new helper: memcpy_from_msg() · 6ce8e9ce
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6ce8e9ce
    • A
      new helper: skb_copy_and_csum_datagram_msg() · 227158db
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      227158db
    • L
      ip_tunnel: the lack of vti_link_ops' dellink() cause kernel panic · 20ea60ca
      lucien 提交于
      Now the vti_link_ops do not point the .dellink, for fb tunnel device
      (ip_vti0), the net_device will be removed as the default .dellink is
      unregister_netdevice_queue,but the tunnel still in the tunnel list,
      then if we add a new vti tunnel, in ip_tunnel_find():
      
              hlist_for_each_entry_rcu(t, head, hash_node) {
                      if (local == t->parms.iph.saddr &&
                          remote == t->parms.iph.daddr &&
                          link == t->parms.link &&
      ==>                 type == t->dev->type &&
                          ip_tunnel_key_match(&t->parms, flags, key))
                              break;
              }
      
      the panic will happen, cause dev of ip_tunnel *t is null:
      [ 3835.072977] IP: [<ffffffffa04103fd>] ip_tunnel_find+0x9d/0xc0 [ip_tunnel]
      [ 3835.073008] PGD b2c21067 PUD b7277067 PMD 0
      [ 3835.073008] Oops: 0000 [#1] SMP
      .....
      [ 3835.073008] Stack:
      [ 3835.073008]  ffff8800b72d77f0 ffffffffa0411924 ffff8800bb956000 ffff8800b72d78e0
      [ 3835.073008]  ffff8800b72d78a0 0000000000000000 ffffffffa040d100 ffff8800b72d7858
      [ 3835.073008]  ffffffffa040b2e3 0000000000000000 0000000000000000 0000000000000000
      [ 3835.073008] Call Trace:
      [ 3835.073008]  [<ffffffffa0411924>] ip_tunnel_newlink+0x64/0x160 [ip_tunnel]
      [ 3835.073008]  [<ffffffffa040b2e3>] vti_newlink+0x43/0x70 [ip_vti]
      [ 3835.073008]  [<ffffffff8150d4da>] rtnl_newlink+0x4fa/0x5f0
      [ 3835.073008]  [<ffffffff812f68bb>] ? nla_strlcpy+0x5b/0x70
      [ 3835.073008]  [<ffffffff81508fb0>] ? rtnl_link_ops_get+0x40/0x60
      [ 3835.073008]  [<ffffffff8150d11f>] ? rtnl_newlink+0x13f/0x5f0
      [ 3835.073008]  [<ffffffff81509cf4>] rtnetlink_rcv_msg+0xa4/0x270
      [ 3835.073008]  [<ffffffff8126adf5>] ? sock_has_perm+0x75/0x90
      [ 3835.073008]  [<ffffffff81509c50>] ? rtnetlink_rcv+0x30/0x30
      [ 3835.073008]  [<ffffffff81529e39>] netlink_rcv_skb+0xa9/0xc0
      [ 3835.073008]  [<ffffffff81509c48>] rtnetlink_rcv+0x28/0x30
      ....
      
      modprobe ip_vti
      ip link del ip_vti0 type vti
      ip link add ip_vti0 type vti
      rmmod ip_vti
      
      do that one or more times, kernel will panic.
      
      fix it by assigning ip_tunnel_dellink to vti_link_ops' dellink, in
      which we skip the unregister of fb tunnel device. do the same on ip6_vti.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20ea60ca
  15. 22 11月, 2014 1 次提交
    • C
      tcp: Restore RFC5961-compliant behavior for SYN packets · 0c228e83
      Calvin Owens 提交于
      Commit c3ae62af ("tcp: should drop incoming frames without ACK
      flag set") was created to mitigate a security vulnerability in which a
      local attacker is able to inject data into locally-opened sockets by
      using TCP protocol statistics in procfs to quickly find the correct
      sequence number.
      
      This broke the RFC5961 requirement to send a challenge ACK in response
      to spurious RST packets, which was subsequently fixed by commit
      7b514a88 ("tcp: accept RST without ACK flag").
      
      Unfortunately, the RFC5961 requirement that spurious SYN packets be
      handled in a similar manner remains broken.
      
      RFC5961 section 4 states that:
      
         ... the handling of the SYN in the synchronized state SHOULD be
         performed as follows:
      
         1) If the SYN bit is set, irrespective of the sequence number, TCP
            MUST send an ACK (also referred to as challenge ACK) to the remote
            peer:
      
            <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
      
            After sending the acknowledgment, TCP MUST drop the unacceptable
            segment and stop processing further.
      
         By sending an ACK, the remote peer is challenged to confirm the loss
         of the previous connection and the request to start a new connection.
         A legitimate peer, after restart, would not have a TCB in the
         synchronized state.  Thus, when the ACK arrives, the peer should send
         a RST segment back with the sequence number derived from the ACK
         field that caused the RST.
      
         This RST will confirm that the remote peer has indeed closed the
         previous connection.  Upon receipt of a valid RST, the local TCP
         endpoint MUST terminate its connection.  The local TCP endpoint
         should then rely on SYN retransmission from the remote end to
         re-establish the connection.
      
      This patch lets SYN packets through the discard added in c3ae62af,
      so that spurious SYN packets are properly dealt with as per the RFC.
      
      The challenge ACK is sent unconditionally and is rate-limited, so the
      original vulnerability is not reintroduced by this patch.
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c228e83