1. 06 1月, 2015 8 次提交
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • D
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann 提交于
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • F
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal 提交于
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • T
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • T
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5961de9f
    • T
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert 提交于
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • T
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert 提交于
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      224d019c
    • T
      netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel() · 149118d8
      Thomas Graf 提交于
      Calling nla_nest_cancel() in a different order as the nesting was
      built up can lead to negative offsets being calculated which
      results in skb_trim() being called with an underflowed unsigned
      int. Warn if mark < skb->data as it's definitely a bug.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      149118d8
  2. 05 1月, 2015 3 次提交
    • J
      geneve: Remove socket hash table. · df5dba8e
      Jesse Gross 提交于
      The hash table for open Geneve ports is used only on creation and
      deletion time. It is not performance critical and is not likely to
      grow to a large number of items. Therefore, this can be changed
      to use a simple linked list.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5dba8e
    • J
      geneve: Simplify locking. · 829a3ada
      Jesse Gross 提交于
      The existing Geneve locking scheme was pulled over directly from
      VXLAN. However, VXLAN has a number of built in mechanisms which make
      the locking more complex and are unlikely to be necessary with Geneve.
      This simplifies the locking to use a basic scheme of a mutex
      when doing updates plus RCU on receive.
      
      In addition to making the code easier to read, this also avoids the
      possibility of a race when creating or destroying sockets since
      UDP sockets and the list of Geneve sockets are protected by different
      locks. After this change, the entire operation is atomic.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829a3ada
    • J
      geneve: Remove workqueue. · 61f3cade
      Jesse Gross 提交于
      The work queue is used only to free the UDP socket upon destruction.
      This is not necessary with Geneve and generally makes the code more
      difficult to reason about. It also introduces nondeterministic
      behavior such as when a socket is rapidly deleted and recreated, which
      could fail as the the deletion happens asynchronously.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61f3cade
  3. 01 1月, 2015 1 次提交
    • A
      fib_trie: Push rcu_read_lock/unlock to callers · 345e9b54
      Alexander Duyck 提交于
      This change is to start cleaning up some of the rcu_read_lock/unlock
      handling.  I realized while reviewing the code there are several spots that
      I don't believe are being handled correctly or are masking warnings by
      locally calling rcu_read_lock/unlock instead of calling them at the correct
      level.
      
      A common example is a call to fib_get_table followed by fib_table_lookup.
      The rcu_read_lock/unlock ought to wrap both but there are several spots where
      they were not wrapped.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      345e9b54
  4. 27 12月, 2014 6 次提交
  5. 21 12月, 2014 2 次提交
  6. 20 12月, 2014 4 次提交
  7. 19 12月, 2014 6 次提交
  8. 11 12月, 2014 5 次提交
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
    • J
      irda: Convert function pointer arrays and uses to const · 785c20a0
      Joe Perches 提交于
      Making things const is a good thing.
      
      (x86-64 defconfig with all irda)
      $ size net/irda/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       109276	   1868	    244	 111388	  1b31c	net/irda/built-in.o.new
       108828	   2316	    244	 111388	  1b31c	net/irda/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      785c20a0
    • J
      llc: Make llc_sap_action_t function pointer arrays const · 22bbf5f3
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      
      Net change:
      
      $ size net/llc/built-in.o.*
         text	   data	    bss	    dec	    hex	filename
        61193	  12758	   1344	  75295	  1261f	net/llc/built-in.o.new
        47113	  27030	   1344	  75487	  126df	net/llc/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22bbf5f3
    • J
      llc: Make llc_conn_ev_qfyr_t function pointer arrays const · 9b373069
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      
      Net change from original:
      
      $ size net/llc/built-in.o.*
         text	   data	    bss	    dec	    hex	filename
        61065	  12886	   1344	  75295	  1261f	net/llc/built-in.o.new
        47113	  27030	   1344	  75487	  126df	net/llc/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b373069
    • J
      llc: Make function pointer arrays const · 14b7d95f
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14b7d95f
  9. 10 12月, 2014 5 次提交