1. 21 8月, 2018 1 次提交
  2. 15 8月, 2018 1 次提交
  3. 13 8月, 2018 3 次提交
    • H
      parisc: Drop architecture-specific ENOTSUP define · 93cb8e20
      Helge Deller 提交于
      parisc is the only Linux architecture which has defined a value for ENOTSUP.
      All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.
      
      Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
      problems with userspace programs which expect both to be the same.  One such
      example is a build error in the libuv package, as can be seen in
      https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.
      
      Since we dropped HP-UX support, there is no real benefit in keeping an own
      value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
      kernel sources. glibc needs no patch, it reuses the exported headers.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      93cb8e20
    • D
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      e8d2bec0
    • L
      init: rename and re-order boot_cpu_state_init() · b5b1404d
      Linus Torvalds 提交于
      This is purely a preparatory patch for upcoming changes during the 4.19
      merge window.
      
      We have a function called "boot_cpu_state_init()" that isn't really
      about the bootup cpu state: that is done much earlier by the similarly
      named "boot_cpu_init()" (note lack of "state" in name).
      
      This function initializes some hotplug CPU state, and needs to run after
      the percpu data has been properly initialized.  It even has a comment to
      that effect.
      
      Except it _doesn't_ actually run after the percpu data has been properly
      initialized.  On x86 it happens to do that, but on at least arm and
      arm64, the percpu base pointers are initialized by the arch-specific
      'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
      
      This had some unexpected results, and in particular we have a patch
      pending for the merge window that did the obvious cleanup of using
      'this_cpu_write()' in the cpu hotplug init code:
      
        -       per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
        +       this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
      
      which is obviously the right thing to do.  Except because of the
      ordering issue, it actually failed miserably and unexpectedly on arm64.
      
      So this just fixes the ordering, and changes the name of the function to
      be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
      hotplug state, because the core CPU state was supposed to have already
      been done earlier.
      
      Marked for stable, since the (not yet merged) patch that will show this
      problem is marked for stable.
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMian Yousaf Kaukab <yousaf.kaukab@suse.com>
      Suggested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5b1404d
  4. 11 8月, 2018 4 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • Y
      bpf: btf: add pretty print for hash/lru_hash maps · 699c86d6
      Yonghong Song 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") added pretty print support to array map.
      This patch adds pretty print for hash and lru_hash maps.
      The following example shows the pretty-print result of
      a pinned hashmap:
      
          struct map_value {
                  int count_a;
                  int count_b;
          };
      
          cat /sys/fs/bpf/pinned_hash_map:
      
          87907: {87907,87908}
          57354: {37354,57355}
          76625: {76625,76626}
          ...
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      699c86d6
    • Y
      bpf: fix bpffs non-array map seq_show issue · dc1508a5
      Yonghong Song 提交于
      In function map_seq_next() of kernel/bpf/inode.c,
      the first key will be the "0" regardless of the map type.
      This works for array. But for hash type, if it happens
      key "0" is in the map, the bpffs map show will miss
      some items if the key "0" is not the first element of
      the first bucket.
      
      This patch fixed the issue by guaranteeing to get
      the first element, if the seq_show is just started,
      by passing NULL pointer key to map_get_next_key() callback.
      This way, no missing elements will occur for
      bpffs hash table show even if key "0" is in the map.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc1508a5
  5. 10 8月, 2018 2 次提交
  6. 09 8月, 2018 2 次提交
  7. 07 8月, 2018 2 次提交
    • R
      bpf: introduce update_effective_progs() · 85fc4b16
      Roman Gushchin 提交于
      __cgroup_bpf_attach() and __cgroup_bpf_detach() functions have
      a good amount of duplicated code, which is possible to eliminate
      by introducing the update_effective_progs() helper function.
      
      The update_effective_progs() calls compute_effective_progs()
      and then in case of success it calls activate_effective_progs()
      for each descendant cgroup. In case of failure (OOM), it releases
      allocated prog arrays and return the error code.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      85fc4b16
    • T
      cpu/hotplug: Fix SMT supported evaluation · bc2d8d26
      Thomas Gleixner 提交于
      Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
      cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
      on the kernel command line as it cannot differentiate between SMT disabled
      by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
      makes the sysfs interface unusable.
      
      Rework this so that during bringup of the non boot CPUs the availability of
      SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
      'primary' thread then set the local cpu_smt_available marker and evaluate
      this explicitely right after the initial SMP bringup has finished.
      
      SMT evaulation on x86 is a trainwreck as the firmware has all the
      information _before_ booting the kernel, but there is no interface to query
      it.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc2d8d26
  8. 06 8月, 2018 3 次提交
  9. 03 8月, 2018 12 次提交
  10. 02 8月, 2018 5 次提交
  11. 01 8月, 2018 3 次提交
  12. 31 7月, 2018 2 次提交
    • M
      arm64: perf: Add cap_user_time aarch64 · 9d2dcc8f
      Michael O'Farrell 提交于
      It is useful to get the running time of a thread.  Doing so in an
      efficient manner can be important for performance of user applications.
      Avoiding system calls in `clock_gettime` when handling
      CLOCK_THREAD_CPUTIME_ID is important.  Other clocks are handled in the
      VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.
      
      CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
      costs associated with maintaining updated user space accessible time
      offsets.  These offsets have to be updated everytime the a thread is
      scheduled/descheduled.  However, for programs regularly checking the
      running time of a thread, this is a performance improvement.
      
      This patch takes a middle ground, and adds support for cap_user_time an
      optional feature of the perf_event API.  This way costs are only
      incurred when the perf_event api is enabled.  This is done the same way
      as it is in x86.
      
      Ultimately this allows calculating the thread running time in userspace
      on aarch64 as follows (adapted from perf_event_open manpage):
      
      u32 seq, time_mult, time_shift;
      u64 running, count, time_offset, quot, rem, delta;
      struct perf_event_mmap_page *pc;
      pc = buf;  // buf is the perf event mmaped page as documented in the API.
      
      if (pc->cap_usr_time) {
          do {
              seq = pc->lock;
              barrier();
              running = pc->time_running;
      
              count = readCNTVCT_EL0();  // Read ARM hardware clock.
              time_offset = pc->time_offset;
              time_mult   = pc->time_mult;
              time_shift  = pc->time_shift;
      
              barrier();
          } while (pc->lock != seq);
      
          quot = (count >> time_shift);
          rem = count & (((u64)1 << time_shift) - 1);
          delta = time_offset + quot * time_mult +
                  ((rem * time_mult) >> time_shift);
      
          running += delta;
          // running now has the current nanosecond level thread time.
      }
      
      Summary of changes in the patch:
      
      For aarch64 systems, make arch_perf_update_userpage update the timing
      information stored in the perf_event page.  Requiring the following
      calculations:
        - Calculate the appropriate time_mult, and time_shift factors to convert
          ticks to nano seconds for the current clock frequency.
        - Adjust the mult and shift factors to avoid shift factors of 32 bits.
          (possibly unnecessary)
        - The time_offset userspace should apply when doing calculations:
          negative the current sched time (now), because time_running and
          time_enabled fields of the perf_event page have just been updated.
      Toggle bits to appropriate values:
        - Enable cap_user_time
      Signed-off-by: NMichael O'Farrell <micpof@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      9d2dcc8f
    • Y
      audit: fix potential null dereference 'context->module.name' · b305f7ed
      Yi Wang 提交于
      The variable 'context->module.name' may be null pointer when
      kmalloc return null, so it's better to check it before using
      to avoid null dereference.
      Another one more thing this patch does is using kstrdup instead
      of (kmalloc + strcpy), and signal a lost record via audit_log_lost.
      
      Cc: stable@vger.kernel.org # 4.11
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b305f7ed