1. 15 3月, 2016 2 次提交
  2. 14 3月, 2016 9 次提交
  3. 12 3月, 2016 6 次提交
  4. 11 3月, 2016 5 次提交
  5. 10 3月, 2016 10 次提交
    • W
      net: validate variable length ll headers · 2793a23a
      Willem de Bruijn 提交于
      Netdevice parameter hard_header_len is variously interpreted both as
      an upper and lower bound on link layer header length. The field is
      used as upper bound when reserving room at allocation, as lower bound
      when validating user input in PF_PACKET.
      
      Clarify the definition to be maximum header length. For validation
      of untrusted headers, add an optional validate member to header_ops.
      
      Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
      for deliberate testing of corrupt input. In this case, pad trailing
      bytes, as some device drivers expect completely initialized headers.
      
      See also http://comments.gmane.org/gmane.linux.network/401064Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2793a23a
    • J
      NFC: microread: Drop platform data header file · 87aca737
      Jean Delvare 提交于
      Originally I only wanted to drop the unneeded inclusion of
      <linux/i2c.h>, but then noticed that struct
      microread_nfc_platform_data isn't actually used, and
      MICROREAD_DRIVER_NAME is redefined in the only file where it is used,
      so we can get rid of the header file and dead code altogether.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
      Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      87aca737
    • T
      kcm: Add receive message timeout · 29152a34
      Tom Herbert 提交于
      This patch adds receive timeout for message assembly on the attached TCP
      sockets. The timeout is set when a new messages is started and the whole
      message has not been received by TCP (not in the receive queue). If the
      completely message is subsequently received the timer is cancelled, if the
      timer expires the RX side is aborted.
      
      The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
      set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
      to KCM.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29152a34
    • T
      kcm: Add memory limit for receive message construction · 7ced95ef
      Tom Herbert 提交于
      Message assembly is performed on the TCP socket. This is logically
      equivalent of an application that performs a peek on the socket to find
      out how much memory is needed for a receive buffer. The receive socket
      buffer also provides the maximum message size which is checked.
      
      The receive algorithm is something like:
      
         1) Receive the first skbuf for a message (or skbufs if multiple are
            needed to determine message length).
         2) Check the message length against the number of bytes in the TCP
            receive queue (tcp_inq()).
      	- If all the bytes of the message are in the queue (incluing the
      	  skbuf received), then proceed with message assembly (it should
      	  complete with the tcp_read_sock)
              - Else, mark the psock with the number of bytes needed to
      	  complete the message.
         3) In TCP data ready function, if the psock indicates that we are
            waiting for the rest of the bytes of a messages, check the number
            of queued bytes against that.
              - If there are still not enough bytes for the message, just
      	  return
              - Else, clear the waiting bytes and proceed to receive the
      	  skbufs.  The message should now be received in one
      	  tcp_read_sock
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ced95ef
    • T
      kcm: Add statistics and proc interfaces · cd6e111b
      Tom Herbert 提交于
      This patch adds various counters for KCM. These include counters for
      messages and bytes received or sent, as well as counters for number of
      attached/unattached TCP sockets and other error or edge events.
      
      The statistics are exposed via a proc interface. /proc/net/kcm provides
      statistics per KCM socket and per psock (attached TCP sockets).
      /proc/net/kcm_stats provides aggregate statistics.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6e111b
    • T
      kcm: Kernel Connection Multiplexor module · ab7ac4eb
      Tom Herbert 提交于
      This module implements the Kernel Connection Multiplexor.
      
      Kernel Connection Multiplexor (KCM) is a facility that provides a
      message based interface over TCP for generic application protocols.
      With KCM an application can efficiently send and receive application
      protocol messages over TCP using datagram sockets.
      
      For more information see the included Documentation/networking/kcm.txt
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab7ac4eb
    • T
      tcp: Add tcp_inq to get available receive bytes on socket · 473bd239
      Tom Herbert 提交于
      Create a common kernel function to get the number of bytes available
      on a TCP socket. This is based on code in INQ getsockopt and we now call
      the function for that getsockopt.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      473bd239
    • T
      net: Add MSG_BATCH flag · f092276d
      Tom Herbert 提交于
      Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
      indicate that more messages will follow (i.e. a batch of messages is
      being sent). This is similar to MSG_MORE except that the following
      messages are not merged into one packet, they are sent individually.
      sendmmsg is updated so that each contained message except for the
      last one is marked as MSG_BATCH.
      
      MSG_BATCH is a performance optimization in cases where a socket
      implementation can benefit by transmitting packets in a batch.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f092276d
    • T
      net: Make sock_alloc exportable · f4a00aac
      Tom Herbert 提交于
      Export it for cases where we want to create sockets by hand.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4a00aac
    • T
      rcu: Add list_next_or_null_rcu · ff3c44e6
      Tom Herbert 提交于
      This is a convenience function that returns the next entry in an RCU
      list or NULL if at the end of the list.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff3c44e6
  6. 09 3月, 2016 8 次提交
    • D
      ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET · e28e87ed
      Daniel Borkmann 提交于
      Helpers like ip_tunnel_info_opts_{get,set}() are only available if
      CONFIG_INET is set, thus add an empty definition into the header for
      the !CONFIG_INET case, where already other empty inline helpers are
      defined.
      
      This avoids ifdef kludge inside filter.c, but also vxlan and geneve
      themself where this facility can only be used with, depend on INET
      being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
      set in flags.
      
      Fixes: 14ca0751 ("bpf: support for access to tunnel options")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e28e87ed
    • A
      bpf: convert stackmap to pre-allocation · 557c0c6e
      Alexei Starovoitov 提交于
      It was observed that calling bpf_get_stackid() from a kprobe inside
      slub or from spin_unlock causes similar deadlock as with hashmap,
      therefore convert stackmap to use pre-allocated memory.
      
      The call_rcu is no longer feasible mechanism, since delayed freeing
      causes bpf_get_stackid() to fail unpredictably when number of actual
      stacks is significantly less than user requested max_entries.
      Since elements are no longer freed into slub, we can push elements into
      freelist immediately and let them be recycled.
      However the very unlikley race between user space map_lookup() and
      program-side recycling is possible:
           cpu0                          cpu1
           ----                          ----
      user does lookup(stackidX)
      starts copying ips into buffer
                                         delete(stackidX)
                                         calls bpf_get_stackid()
      				   which recyles the element and
                                         overwrites with new stack trace
      
      To avoid user space seeing a partial stack trace consisting of two
      merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
      to preserve consistent stack trace delivery to user space.
      Now we can move memset(,0) of left-over element value from critical
      path of bpf_get_stackid() into slow-path of user space lookup.
      Also disallow lookup() from bpf program, since it's useless and
      program shouldn't be messing with collected stack trace.
      
      Note that similar race between user space lookup and kernel side updates
      is also present in hashmap, but it's not a new race. bpf programs were
      always allowed to modify hash and array map elements while user space
      is copying them.
      
      Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      557c0c6e
    • A
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov 提交于
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c905981
    • A
      bpf: prevent kprobe+bpf deadlocks · b121d1e7
      Alexei Starovoitov 提交于
      if kprobe is placed within update or delete hash map helpers
      that hold bucket spin lock and triggered bpf program is trying to
      grab the spinlock for the same bucket on the same cpu, it will
      deadlock.
      Fix it by extending existing recursion prevention mechanism.
      
      Note, map_lookup and other tracing helpers don't have this problem,
      since they don't hold any locks and don't modify global data.
      bpf_trace_printk has its own recursive check and ok as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b121d1e7
    • M
      ipv6: per netns FIB garbage collection · 3dc94f93
      Michal Kubeček 提交于
      One of our customers observed issues with FIB6 garbage collectors
      running in different network namespaces blocking each other, resulting
      in soft lockups (fib6_run_gc() initiated from timer runs always in
      forced mode).
      
      Now that FIB6 walkers are separated per namespace, there is no more need
      for instances of fib6_run_gc() in different namespaces blocking each
      other. There is still a call to icmp6_dst_gc() which operates on shared
      data but this function is protected by its own shared lock.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dc94f93
    • M
      ipv6: per netns fib6 walkers · 9a03cd8f
      Michal Kubeček 提交于
      The IPv6 FIB data structures are separated per network namespace but
      there is still only one global walkers list and one global walker list
      lock. This means changes in one namespace unnecessarily interfere with
      walkers in other namespaces.
      
      Replace the global list with per-netns lists (and give each its own
      lock).
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a03cd8f
    • D
      bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139
      Daniel Borkmann 提交于
      The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
      device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
      ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
      when ip_tunnel_info is used is unfortunately not always valid as assumed.
      
      While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
      however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
      with different remote dsts, tos, etc, therefore they cannot be assumed as
      packet independent.
      
      Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
      would reuse the same entry that was first created on the initial route
      lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
      a different one.
      
      Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
      in vxlan needs to be handeled differently in this context as it is currently
      inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
      helper is added for the three tunnel cases, which checks if we can use dst
      cache.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db3c6139
    • D
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann 提交于
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14ca0751