1. 07 4月, 2016 1 次提交
    • J
      vxlan: implement GPE · e1e5314d
      Jiri Benc 提交于
      Implement VXLAN-GPE. Only COLLECT_METADATA is supported for now (it is
      possible to support static configuration, too, if there is demand for it).
      
      The GPE header parsing has to be moved before iptunnel_pull_header, as we
      need to know the protocol.
      
      v2: Removed what was called "L2 mode" in v1 of the patchset. Only "L3 mode"
          (now called "raw mode") is added by this patch. This mode does not allow
          Ethernet header to be encapsulated in VXLAN-GPE when using ip route to
          specify the encapsulation, IP header is encapsulated instead. The patch
          does support Ethernet to be encapsulated, though, using ETH_P_TEB in
          skb->protocol. This will be utilized by other COLLECT_METADATA users
          (openvswitch in particular).
      
          If there is ever demand for Ethernet encapsulation with VXLAN-GPE using
          ip route, it's easy to add a new flag switching the interface to
          "Ethernet mode" (called "L2 mode" in v1 of this patchset). For now,
          leave this out, it seems we don't need it.
      
          Disallowed more flag combinations, especially RCO with GPE.
          Added comment explaining that GBP and GPE cannot be set together.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1e5314d
  2. 05 4月, 2016 2 次提交
    • E
      sock_diag: add SK_MEMINFO_DROPS · 15239302
      Eric Dumazet 提交于
      Reporting sk_drops to user space was available for UDP
      sockets using /proc interface.
      
      Add this to sock_diag, so that we can have the same information
      available to ss users, and we'll be able to add sk_drops
      indications for TCP sockets as well.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15239302
    • S
      sock: accept SO_TIMESTAMPING flags in socket cmsg · 3dd17e63
      Soheil Hassas Yeganeh 提交于
      Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
      as a basis to accept timestamping requests per write.
      
      This implementation only accepts TX recording flags (i.e.,
      SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
      SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
      control messages. Users need to set reporting flags (e.g.,
      SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
      
      This commit adds a tsflags field in sockcm_cookie which is
      set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
      bits in sockcm_cookie.tsflags allowing the control message
      to override the recording behavior per write, yet maintaining
      the value of other flags.
      
      This patch implements validating the control message and setting
      tsflags in struct sockcm_cookie. Next commits in this series will
      actually implement timestamping per write for different protocols.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dd17e63
  3. 31 3月, 2016 1 次提交
    • D
      bpf: make padding in bpf_tunnel_key explicit · c0e760c9
      Daniel Borkmann 提交于
      Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
      and tunnel_label members explicit. No issue has been observed, and
      gcc/llvm does padding for the old struct already, where tunnel_label
      was not yet present, so the current code works, but since it's part
      of uapi, make sure we don't introduce holes in structs.
      
      Therefore, add tunnel_ext that we can use generically in future
      (f.e. to flag OAM messages for backends, etc). Also add the offset
      to the compat tests to be sure should some compilers not padd the
      tail of the old version of bpf_tunnel_key.
      
      Fixes: 4018ab18 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e760c9
  4. 23 3月, 2016 3 次提交
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
    • A
      rapidio: add mport char device driver · e8de3701
      Alexandre Bounine 提交于
      Add mport character device driver to provide user space interface to
      basic RapidIO subsystem operations.
      
      See included Documentation/rapidio/mport_cdev.txt for more details.
      
      [akpm@linux-foundation.org: fix printk warning on i386]
      [dan.carpenter@oracle.com: mport_cdev: fix some error codes]
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Tested-by: NBarry Wood <barry.wood@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8de3701
    • D
      ethtool: minor doc update · 5f2d4724
      David Decotigny 提交于
      Updates: commit 793cf87d ("ethtool: Set cmd field in
               ETHTOOL_GLINKSETTINGS response to wrong nwords")
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f2d4724
  5. 22 3月, 2016 2 次提交
  6. 19 3月, 2016 1 次提交
  7. 18 3月, 2016 4 次提交
  8. 16 3月, 2016 2 次提交
  9. 15 3月, 2016 3 次提交
    • J
      openvswitch: Interface with NAT. · 05752523
      Jarno Rajahalme 提交于
      Extend OVS conntrack interface to cover NAT.  New nested
      OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action.
      A bare OVS_CT_ATTR_NAT only mangles existing and expected connections.
      If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested
      attributes, new (non-committed/non-confirmed) connections are mangled
      according to the rest of the nested attributes.
      
      The corresponding OVS userspace patch series includes test cases (in
      tests/system-traffic.at) that also serve as example uses.
      
      This work extends on a branch by Thomas Graf at
      https://github.com/tgraf/ovs/tree/nat.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05752523
    • J
      netfilter: Remove IP_CT_NEW_REPLY definition. · bfa3f9d7
      Jarno Rajahalme 提交于
      Remove the definition of IP_CT_NEW_REPLY from the kernel as it does
      not make sense.  This allows the definition of IP_CT_NUMBER to be
      simplified as well.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bfa3f9d7
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  10. 14 3月, 2016 2 次提交
  11. 12 3月, 2016 3 次提交
  12. 11 3月, 2016 5 次提交
  13. 10 3月, 2016 2 次提交
  14. 09 3月, 2016 4 次提交
    • A
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov 提交于
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c905981
    • D
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann 提交于
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14ca0751
    • D
      bpf: allow to propagate df in bpf_skb_set_tunnel_key · 22080870
      Daniel Borkmann 提交于
      Added by 9a628224 ("ip_tunnel: Add dont fragment flag."), allow to
      feed df flag into tunneling facilities (currently supported on TX by
      vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
      helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22080870
    • D
      bpf: add flags to bpf_skb_store_bytes for clearing hash · 8afd54c8
      Daniel Borkmann 提交于
      When overwriting parts of the packet with bpf_skb_store_bytes() that
      were fed previously into skb->hash calculation, we should clear the
      current hash with skb_clear_hash(), so that a next skb_get_hash() call
      can determine the correct hash related to this skb.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8afd54c8
  15. 08 3月, 2016 1 次提交
  16. 06 3月, 2016 2 次提交
    • D
      nfit, libnvdimm: clear poison command support · d4f32367
      Dan Williams 提交于
      Add the boiler-plate for a 'clear error' command based on section
      9.20.7.6 "Function Index 4 - Clear Uncorrectable Error" from the ACPI
      6.1 specification, and add a reference implementation in nfit_test.
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d4f32367
    • R
      usb: devio: Add ioctl to disallow detaching kernel USB drivers. · d883f52e
      Reilly Grant 提交于
      The new USBDEVFS_DROP_PRIVILEGES ioctl allows a process to voluntarily
      relinquish the ability to issue other ioctls that may interfere with
      other processes and drivers that have claimed an interface on the
      device.
      
      This commit also includes a simple utility to be able to test the
      ioctl, located at Documentation/usb/usbdevfs-drop-permissions.c
      
      Example (with qemu-kvm's input device):
      
          $ lsusb
          ...
          Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd
      
          $ usb-devices
          ...
          C:  #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
          I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=02 Driver=usbhid
      
          $ sudo ./usbdevfs-drop-permissions /dev/bus/usb/001/002
          OK: privileges dropped!
          Available options:
          [0] Exit now
          [1] Reset device. Should fail if device is in use
          [2] Claim 4 interfaces. Should succeed where not in use
          [3] Narrow interface permission mask
          Which option shall I run?: 1
          ERROR: USBDEVFS_RESET failed! (1 - Operation not permitted)
          Which test shall I run next?: 2
          ERROR claiming if 0 (1 - Operation not permitted)
          ERROR claiming if 1 (1 - Operation not permitted)
          ERROR claiming if 2 (1 - Operation not permitted)
          ERROR claiming if 3 (1 - Operation not permitted)
          Which test shall I run next?: 0
      
      After unbinding usbhid:
      
          $ usb-devices
          ...
          I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=02 Driver=(none)
      
          $ sudo ./usbdevfs-drop-permissions /dev/bus/usb/001/002
          ...
          Which option shall I run?: 2
          OK: claimed if 0
          ERROR claiming if 1 (1 - Operation not permitted)
          ERROR claiming if 2 (1 - Operation not permitted)
          ERROR claiming if 3 (1 - Operation not permitted)
          Which test shall I run next?: 1
          OK: USBDEVFS_RESET succeeded
          Which test shall I run next?: 0
      
      After unbinding usbhid and restricting the mask:
      
          $ sudo ./usbdevfs-drop-permissions /dev/bus/usb/001/002
          ...
          Which option shall I run?: 3
          Insert new mask: 0
          OK: privileges dropped!
          Which test shall I run next?: 2
          ERROR claiming if 0 (1 - Operation not permitted)
          ERROR claiming if 1 (1 - Operation not permitted)
          ERROR claiming if 2 (1 - Operation not permitted)
          ERROR claiming if 3 (1 - Operation not permitted)
      Signed-off-by: NReilly Grant <reillyg@chromium.org>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NEmilio López <emilio.lopez@collabora.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d883f52e
  17. 05 3月, 2016 2 次提交