1. 29 6月, 2018 1 次提交
    • D
      bpf: Change bpf_fib_lookup to return lookup status · 4c79579b
      David Ahern 提交于
      For ACLs implemented using either FIB rules or FIB entries, the BPF
      program needs the FIB lookup status to be able to drop the packet.
      Since the bpf_fib_lookup API has not reached a released kernel yet,
      change the return code to contain an encoding of the FIB lookup
      result and return the nexthop device index in the params struct.
      
      In addition, inform the BPF program of any post FIB lookup reason as
      to why the packet needs to go up the stack.
      
      The fib result for unicast routes must have an egress device, so remove
      the check that it is non-NULL.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4c79579b
  2. 21 6月, 2018 1 次提交
  3. 16 6月, 2018 1 次提交
  4. 15 6月, 2018 3 次提交
  5. 13 6月, 2018 1 次提交
  6. 12 6月, 2018 3 次提交
  7. 08 6月, 2018 4 次提交
  8. 07 6月, 2018 1 次提交
  9. 06 6月, 2018 3 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
    • M
      uapi/headers: Provide types_32_64.h · b575e837
      Mathieu Desnoyers 提交于
      Provide helper macros for fields which represent pointers in
      kernel-userspace ABI. This facilitates handling of 32-bit
      user-space by 64-bit kernels by defining those fields as
      32-bit 0-padding and 32-bit integer on 32-bit architectures,
      which allows the kernel to treat those as 64-bit integers.
      The order of padding and 32-bit integer depends on the
      endianness.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-2-mathieu.desnoyers@efficios.com
      b575e837
    • G
      ncpfs: remove uapi .h files · 05e98465
      Greg Kroah-Hartman 提交于
      Now that ncpfs is removed from the tree, there is no need to keep the
      uapi header files around as no one uses them, and it is not a feature
      that the kernel supports anymore.
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05e98465
  10. 05 6月, 2018 1 次提交
  11. 04 6月, 2018 4 次提交
    • B
      xsk: new descriptor addressing scheme · bbff2f32
      Björn Töpel 提交于
      Currently, AF_XDP only supports a fixed frame-size memory scheme where
      each frame is referenced via an index (idx). A user passes the frame
      index to the kernel, and the kernel acts upon the data.  Some NICs,
      however, do not have a fixed frame-size model, instead they have a
      model where a memory window is passed to the hardware and multiple
      frames are filled into that window (referred to as the "type-writer"
      model).
      
      By changing the descriptor format from the current frame index
      addressing scheme, AF_XDP can in the future be extended to support
      these kinds of NICs.
      
      In the index-based model, an idx refers to a frame of size
      frame_size. Addressing a frame in the UMEM is done by offseting the
      UMEM starting address by a global offset, idx * frame_size + offset.
      Communicating via the fill- and completion-rings are done by means of
      idx.
      
      In this commit, the idx is removed in favor of an address (addr),
      which is a relative address ranging over the UMEM. To convert an
      idx-based address to the new addr is simply: addr = idx * frame_size +
      offset.
      
      We also stop referring to the UMEM "frame" as a frame. Instead it is
      simply called a chunk.
      
      To transfer ownership of a chunk to the kernel, the addr of the chunk
      is passed in the fill-ring. Note, that the kernel will mask addr to
      make it chunk aligned, so there is no need for userspace to do
      that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
      3000 to the fill-ring will refer to the same chunk.
      
      On the completion-ring, the addr will match that of the Tx descriptor,
      passed to the kernel.
      
      Changing the descriptor format to use chunks/addr will allow for
      future changes to move to a type-writer based model, where multiple
      frames can reside in one chunk. In this model passing one single chunk
      into the fill-ring, would potentially result in multiple Rx
      descriptors.
      
      This commit changes the uapi of AF_XDP sockets, and updates the
      documentation.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bbff2f32
    • D
      bpf: flowlabel in bpf_fib_lookup should be flowinfo · bd3a08aa
      David Ahern 提交于
      As Michal noted the flow struct takes both the flow label and priority.
      Update the bpf_fib_lookup API to note that it is flowinfo and not just
      the flow label.
      
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bd3a08aa
    • Y
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song 提交于
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
    • S
      rpmsg: char: Switch to SPDX license identifier · 136200f4
      Suman Anna 提交于
      Use the appropriate SPDX license identifier in the rpmsg char driver
      source file and drop the previous boilerplate license text. The uapi
      header file already had the SPDX license identifier added as part of
      a mass update but the license text removal was deferred for later,
      and this patch drops the same.
      Signed-off-by: NSuman Anna <s-anna@ti.com>
      Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
      136200f4
  12. 03 6月, 2018 4 次提交
    • D
      bpf: make sure to clear unused fields in tunnel/xfrm state fetch · 1fbc2e0c
      Daniel Borkmann 提交于
      Since the remaining bits are not filled in struct bpf_tunnel_key
      resp. struct bpf_xfrm_state and originate from uninitialized stack
      space, we should make sure to clear them before handing control
      back to the program.
      
      Also add a padding element to struct bpf_xfrm_state for future use
      similar as we have in struct bpf_tunnel_key and clear it as well.
      
        struct bpf_xfrm_state {
            __u32                      reqid;            /*     0     4 */
            __u32                      spi;              /*     4     4 */
            __u16                      family;           /*     8     2 */
      
            /* XXX 2 bytes hole, try to pack */
      
            union {
                __u32              remote_ipv4;          /*           4 */
                __u32              remote_ipv6[4];       /*          16 */
            };                                           /*    12    16 */
      
            /* size: 28, cachelines: 1, members: 4 */
            /* sum members: 26, holes: 1, sum holes: 2 */
            /* last cacheline: 28 bytes */
        };
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1fbc2e0c
    • D
      bpf: add bpf_skb_cgroup_id helper · cb20b08e
      Daniel Borkmann 提交于
      Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
      cgroup id from the skb's socket. This is useful in particular to
      enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
      cgroup v2 by allowing ID based matching on egress. This can in
      particular be used in combination with applying policy e.g. from
      map lookups, and also complements the older bpf_skb_under_cgroup()
      interface. In user space the cgroup id for a given path can be
      retrieved through the f_handle as demonstrated in [0] recently.
      
        [0] https://lkml.org/lkml/2018/5/22/1190Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      cb20b08e
    • O
      PCI/DPC: Disable ERR_NONFATAL handling by DPC · 6927868e
      Oza Pawandeep 提交于
      PCIe ERR_NONFATAL errors mean a particular transaction is unreliable but
      the Link is otherwise fully functional (PCIe r4.0, sec 6.2.2).
      
      The AER driver handles these by logging the error details and calling
      driver-supplied pci_error_handlers callbacks.  It does not reset downstream
      devices, does not remove them from the PCI subsystem, does not re-enumerate
      them, and does not call their driver .remove() or .probe() methods.
      
      But DPC driver previously enabled DPC on ERR_NONFATAL, so if the hardware
      supports DPC, these errors caused a Link reset (performed automatically by
      the hardware), followed by the DPC driver removing affected devices (which
      calls their .remove() methods), bringing the Link back up, and
      re-enumerating (which calls driver .probe() methods).
      
      Disable ERR_NONFATAL DPC triggering so these errors will only be handled by
      AER.  This means drivers won't have to deal with different usage of their
      pci_error_handlers callbacks and .probe() and .remove() methods based on
      whether the platform has DPC support.
      Signed-off-by: NOza Pawandeep <poza@codeaurora.org>
      [bhelgaas: changelog]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      6927868e
    • P
      netfilter: nf_tables: add connlimit support · 290180e2
      Pablo Neira Ayuso 提交于
      This features which allows you to limit the maximum number of
      connections per arbitrary key. The connlimit expression is stateful,
      therefore it can be used from meters to dynamically populate a set, this
      provides a mapping to the iptables' connlimit match. This patch also
      comes that allows you define static connlimit policies.
      
      This extension depends on the nf_conncount infrastructure.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      290180e2
  13. 02 6月, 2018 1 次提交
    • D
      bpf: fix uapi hole for 32 bit compat applications · 36f9814a
      Daniel Borkmann 提交于
      In 64 bit, we have a 4 byte hole between ifindex and netns_dev in the
      case of struct bpf_map_info but also struct bpf_prog_info. In net-next
      commit b85fab0e ("bpf: Add gpl_compatible flag to struct bpf_prog_info")
      added a bitfield into it to expose some flags related to programs. Thus,
      add an unnamed __u32 bitfield for both so that alignment keeps the same
      in both 32 and 64 bit cases, and can be naturally extended from there
      as in b85fab0e.
      
      Before:
      
        # file test.o
        test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
        # pahole test.o
        struct bpf_map_info {
      	__u32                      type;                 /*     0     4 */
      	__u32                      id;                   /*     4     4 */
      	__u32                      key_size;             /*     8     4 */
      	__u32                      value_size;           /*    12     4 */
      	__u32                      max_entries;          /*    16     4 */
      	__u32                      map_flags;            /*    20     4 */
      	char                       name[16];             /*    24    16 */
      	__u32                      ifindex;              /*    40     4 */
      	__u64                      netns_dev;            /*    44     8 */
      	__u64                      netns_ino;            /*    52     8 */
      
      	/* size: 64, cachelines: 1, members: 10 */
      	/* padding: 4 */
        };
      
      After (same as on 64 bit):
      
        # file test.o
        test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
        # pahole test.o
        struct bpf_map_info {
      	__u32                      type;                 /*     0     4 */
      	__u32                      id;                   /*     4     4 */
      	__u32                      key_size;             /*     8     4 */
      	__u32                      value_size;           /*    12     4 */
      	__u32                      max_entries;          /*    16     4 */
      	__u32                      map_flags;            /*    20     4 */
      	char                       name[16];             /*    24    16 */
      	__u32                      ifindex;              /*    40     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	__u64                      netns_dev;            /*    48     8 */
      	__u64                      netns_ino;            /*    56     8 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      
      	/* size: 64, cachelines: 1, members: 10 */
      	/* sum members: 60, holes: 1, sum holes: 4 */
        };
      Reported-by: NDmitry V. Levin <ldv@altlinux.org>
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 52775b33 ("bpf: offload: report device information about offloaded maps")
      Fixes: 675fc275 ("bpf: offload: report device information for offloaded programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      36f9814a
  14. 01 6月, 2018 4 次提交
  15. 31 5月, 2018 5 次提交
  16. 30 5月, 2018 3 次提交