1. 24 3月, 2018 1 次提交
    • J
      bpf: Remove struct bpf_verifier_env argument from print_bpf_insn · abe08840
      Jiri Olsa 提交于
      We use print_bpf_insn in user space (bpftool and soon perf),
      so it'd be nice to keep it generic and strip it off the kernel
      struct bpf_verifier_env argument.
      
      This argument can be safely removed, because its users can
      use the struct bpf_insn_cbs::private_data to pass it.
      
      By changing the argument type  we can no longer have clean
      'verbose' alias to 'bpf_verifier_log_write' in verifier.c.
      Instead  we're adding the  'verbose' cb_print callback and
      removing the alias.
      
      This way we have new cb_print callback in place, and all
      the 'verbose(env, ...) calls in verifier.c will cleanly
      cast to 'verbose(void *, ...)' so no other change is
      needed.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      abe08840
  2. 21 3月, 2018 2 次提交
  3. 20 3月, 2018 2 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      sockmap: convert refcnt to an atomic refcnt · ffa35660
      John Fastabend 提交于
      The sockmap refcnt up until now has been wrapped in the
      sk_callback_lock(). So its not actually needed any locking of its
      own. The counter itself tracks the lifetime of the psock object.
      Sockets in a sockmap have a lifetime that is independent of the
      map they are part of. This is possible because a single socket may
      be in multiple maps. When this happens we can only release the
      psock data associated with the socket when the refcnt reaches
      zero. There are three possible delete sock reference decrement
      paths first through the normal sockmap process, the user deletes
      the socket from the map. Second the map is removed and all sockets
      in the map are removed, delete path is similar to case 1. The third
      case is an asyncronous socket event such as a closing the socket. The
      last case handles removing sockets that are no longer available.
      For completeness, although inc does not pose any problems in this
      patch series, the inc case only happens when a psock is added to a
      map.
      
      Next we plan to add another socket prog type to handle policy and
      monitoring on the TX path. When we do this however we will need to
      keep a reference count open across the sendmsg/sendpage call and
      holding the sk_callback_lock() here (on every send) seems less than
      ideal, also it may sleep in cases where we hit memory pressure.
      Instead of dealing with these issues in some clever way simply make
      the reference counting a refcnt_t type and do proper atomic ops.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ffa35660
  4. 15 3月, 2018 1 次提交
    • S
      bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7
      Song Liu 提交于
      Currently, bpf stackmap store address for each entry in the call trace.
      To map these addresses to user space files, it is necessary to maintain
      the mapping from these virtual address to symbols in the binary. Usually,
      the user space profiler (such as perf) has to scan /proc/pid/maps at the
      beginning of profiling, and monitor mmap2() calls afterwards. Given the
      cost of maintaining the address map, this solution is not practical for
      system wide profiling that is always on.
      
      This patch tries to solve this problem with a variation of stackmap. This
      variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
      addresses, the variation stores ELF file build_id + offset.
      
      Build ID is a 20-byte unique identifier for ELF files. The following
      command shows the Build ID of /bin/bash:
      
        [user@]$ readelf -n /bin/bash
        ...
          Build ID: XXXXXXXXXX
        ...
      
      With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
      for each entry in the call trace, and translate it into the following
      struct:
      
        struct bpf_stack_build_id_offset {
                __s32           status;
                unsigned char   build_id[BPF_BUILD_ID_SIZE];
                union {
                        __u64   offset;
                        __u64   ip;
                };
        };
      
      The search of build_id is limited to the first page of the file, and this
      page should be in page cache. Otherwise, we fallback to store ip for this
      entry (ip field in struct bpf_stack_build_id_offset). This requires the
      build_id to be stored in the first page. A quick survey of binary and
      dynamic library files in a few different systems shows that almost all
      binary and dynamic library files have build_id in the first page.
      
      Build_id is only meaningful for user stack. If a kernel stack is added to
      a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
      only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
      lookup failed for some reason, it will also fallback to store ip.
      
      User space can access struct bpf_stack_build_id_offset with bpf
      syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
      maintain mapping from build id to binary files. This mostly static
      mapping is much easier to maintain than per process address maps.
      
      Note: Stackmap with build_id only works in non-nmi context at this time.
      This is because we need to take mm->mmap_sem for find_vma(). If this
      changes, we would like to allow build_id lookup in nmi context.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      615755a7
  5. 14 3月, 2018 3 次提交
  6. 12 3月, 2018 1 次提交
  7. 10 3月, 2018 1 次提交
  8. 09 3月, 2018 4 次提交
    • B
      rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites · 6b0ef92f
      Boqun Feng 提交于
      When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and
      kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a
      HARDIRQ-safe->HARDIRQ-unsafe deadlock:
      
       ================================
       WARNING: inconsistent lock state
       4.16.0-rc4+ #1 Not tainted
       --------------------------------
       inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
       takes:
       __schedule+0xbe/0xaf0
       {IN-HARDIRQ-W} state was registered at:
         _raw_spin_lock+0x2a/0x40
         scheduler_tick+0x47/0xf0
      ...
       other info that might help us debug this:
        Possible unsafe locking scenario:
              CPU0
              ----
         lock(&rq->lock);
         <Interrupt>
           lock(&rq->lock);
        *** DEADLOCK ***
       1 lock held by rcu_torture_rea/724:
       rcu_torture_read_lock+0x0/0x70
       stack backtrace:
       CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
       Call Trace:
        lock_acquire+0x90/0x200
        ? __schedule+0xbe/0xaf0
        _raw_spin_lock+0x2a/0x40
        ? __schedule+0xbe/0xaf0
        __schedule+0xbe/0xaf0
        preempt_schedule_irq+0x2f/0x60
        retint_kernel+0x1b/0x2d
       RIP: 0010:rcu_read_unlock_special+0x0/0x680
        ? rcu_torture_read_unlock+0x60/0x60
        __rcu_read_unlock+0x64/0x70
        rcu_torture_read_unlock+0x17/0x60
        rcu_torture_reader+0x275/0x450
        ? rcutorture_booster_init+0x110/0x110
        ? rcu_torture_stall+0x230/0x230
        ? kthread+0x10e/0x130
        kthread+0x10e/0x130
        ? kthread_create_worker_on_cpu+0x70/0x70
        ? call_usermodehelper_exec_async+0x11a/0x150
        ret_from_fork+0x3a/0x50
      
      This happens with the following even sequence:
      
      	preempt_schedule_irq();
      	  local_irq_enable();
      	  __schedule():
      	    local_irq_disable(); // irq off
      	    ...
      	    rcu_note_context_switch():
      	      rcu_note_preempt_context_switch():
      	        rcu_read_unlock_special():
      	          local_irq_save(flags);
      	          ...
      		  raw_spin_unlock_irqrestore(...,flags); // irq remains off
      	          rt_mutex_futex_unlock():
      	            raw_spin_lock_irq();
      	            ...
      	            raw_spin_unlock_irq(); // accidentally set irq on
      
      	    <return to __schedule()>
      	    rq_lock():
      	      raw_spin_lock(); // acquiring rq->lock with irq on
      
      which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause
      deadlocks in scheduler code.
      
      This problem was introduced by commit 02a7c234 ("rcu: Suppress
      lockdep false-positive ->boost_mtx complaints"). That brought the user
      of rt_mutex_futex_unlock() with irq off.
      
      To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with
      *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock()
      with irq off.
      
      Fixes: 02a7c234 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
      6b0ef92f
    • Q
      bpf: comment why dots in filenames under BPF virtual FS are not allowed · 6d8cb045
      Quentin Monnet 提交于
      When pinning a file under the BPF virtual file system (traditionally
      /sys/fs/bpf), using a dot in the name of the location to pin at is not
      allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
      rejected with -EPERM.
      
      This check was introduced at the same time as the BPF file system
      itself, with commit b2197755 ("bpf: add support for persistent
      maps/progs"). At this time, it was checked in a function called
      "bpf_dname_reserved()", which made clear that using a dot was reserved
      for future extensions.
      
      This function disappeared and the check was moved elsewhere with commit
      0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the
      meaning of the dot ban was lost.
      
      The present commit simply adds a comment in the source to explain to the
      reader that the usage of dots is reserved for future usage.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6d8cb045
    • S
      perf/core: Fix ctx_event_type in ctx_resched() · bd903afe
      Song Liu 提交于
      In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
      added. However, ctx_resched() calculates ctx_event_type before checking
      this condition. As a result, pinned events will NOT get higher priority
      than flexible events.
      
      The following shows this issue on an Intel CPU (where ref-cycles can
      only use one hardware counter).
      
        1. First start:
             perf stat -C 0 -e ref-cycles  -I 1000
        2. Then, in the second console, run:
             perf stat -C 0 -e ref-cycles:D -I 1000
      
      The second perf uses pinned events, which is expected to have higher
      priority. However, because it failed in ctx_resched(). It is never
      run.
      
      This patch fixes this by calculating ctx_event_type after re-evaluating
      event_type.
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd903afe
    • L
      module: propagate error in modules_open() · 3f553b30
      Leon Yu 提交于
      otherwise kernel can oops later in seq_release() due to dereferencing null
      file->private_data which is only set if seq_open() succeeds.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      IP: seq_release+0xc/0x30
      Call Trace:
       close_pdeo+0x37/0xd0
       proc_reg_release+0x5d/0x60
       __fput+0x9d/0x1d0
       ____fput+0x9/0x10
       task_work_run+0x75/0x90
       do_exit+0x252/0xa00
       do_group_exit+0x36/0xb0
       SyS_exit_group+0xf/0x10
      
      Fixes: 516fb7f2 ("/proc/module: use the same logic as /proc/kallsyms for address exposure")
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # 4.15+
      Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      3f553b30
  9. 08 3月, 2018 1 次提交
  10. 07 3月, 2018 1 次提交
  11. 03 3月, 2018 2 次提交
    • D
      memremap: fix softlockup reports at teardown · 949b9325
      Dan Williams 提交于
      The cond_resched() currently in the setup path needs to be duplicated in
      the teardown path. Rather than require each instance of
      for_each_device_pfn() to open code the same sequence, embed it in the
      helper.
      
      Link: https://github.com/intel/ixpdimm_sw/issues/11
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 71389703 ("mm, zone_device: Replace {get, put}_zone_device_page()...")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      949b9325
    • M
      signals: Move put_compat_sigset to compat.h to silence hardened usercopy · fde9fc76
      Matt Redfearn 提交于
      Since commit afcc90f8 ("usercopy: WARN() on slab cache usercopy
      region violations"), MIPS systems booting with a compat root filesystem
      emit a warning when copying compat siginfo to userspace:
      
      WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
      Bad or missing usercopy whitelist? Kernel memory exposure attempt
      detected from SLAB object 'task_struct' (offset 1432, size 16)!
      Modules linked in:
      CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
      Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
      	65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
      	00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
      	000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
      	ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
      	0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
      	000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
      	ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
      	ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
      	0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
      	...
      Call Trace:
      [<ffffffff8010d02c>] show_stack+0x9c/0x130
      [<ffffffff80698810>] dump_stack+0x90/0xd0
      [<ffffffff80137b78>] __warn+0x100/0x118
      [<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
      [<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
      [<ffffffff8021e68c>] __check_object_size+0xfc/0x250
      [<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
      [<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
      [<ffffffff8010b8b4>] do_signal+0x19c/0x230
      [<ffffffff8010c408>] do_notify_resume+0x60/0x78
      [<ffffffff80106f50>] work_notifysig+0x10/0x18
      ---[ end trace 88fffbf69147f48a ]---
      
      Commit 5905429a ("fork: Provide usercopy whitelisting for
      task_struct") noted that:
      
      "While the blocked and saved_sigmask fields of task_struct are copied to
      userspace (via sigmask_to_save() and setup_rt_frame()), it is always
      copied with a static length (i.e. sizeof(sigset_t))."
      
      However, this is not true in the case of compat signals, whose sigset
      is copied by put_compat_sigset and receives size as an argument.
      
      At most call sites, put_compat_sigset is copying a sigset from the
      current task_struct. This triggers a warning when
      CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
      static inline, the warning can be avoided because in all of these cases
      the size is constant at compile time, which is allowed. The only site
      where this is not the case is handling the rt_sigpending syscall, but
      there the copy is being made from a stack local variable so does not
      trigger the warning.
      
      Move put_compat_sigset to compat.h, and mark it static inline. This
      fixes the WARN on MIPS.
      
      Fixes: afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations")
      Signed-off-by: NMatt Redfearn <matt.redfearn@mips.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Dmitry V . Levin" <ldv@altlinux.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18639/Signed-off-by: NJames Hogan <jhogan@kernel.org>
      fde9fc76
  12. 01 3月, 2018 1 次提交
    • L
      timers: Forward timer base before migrating timers · c52232a4
      Lingutla Chandrasekhar 提交于
      On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
      live CPU. This happens from the control thread which initiated the unplug.
      
      If the CPU on which the control thread runs came out from a longer idle
      period then the base clock of that CPU might be stale because the control
      thread runs prior to any event which forwards the clock.
      
      In such a case the timers from the unplugged CPU are queued on the live CPU
      based on the stale clock which can cause large delays due to increased
      granularity of the outer timer wheels which are far away from base:;clock.
      
      But there is a worse problem than that. The following sequence of events
      illustrates it:
      
       - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.
      
         The timer is queued at wheel level 2, with resulting expiry time = 60032
         (due to level granularity).
      
       - CPU1 enters idle @60007, with next timer expiry @60020.
      
       - CPU0 is hotplugged at @60009
      
       - CPU1 exits idle and runs the control thread which migrates the
         timers from CPU0
      
         timer1 is now queued in level 0 for immediate handling in the next
         softirq because the requested expiry time 59969 is before CPU1 base->clk
         60007
      
       - CPU1 runs code which forwards the base clock which succeeds because the
         next expiring timer. which was collected at idle entry time is still set
         to 60020.
      
         So it forwards beyond 60007 and therefore misses to expire the migrated
         timer1. That timer gets expired when the wheel wraps around again, which
         takes between 63 and 630ms depending on the HZ setting.
      
      Address both problems by invoking forward_timer_base() for the control CPUs
      timer base. All other places, which might run into a similar problem
      (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
      that.
      
      [ tglx: Massaged comment and changelog ]
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Co-developed-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NLingutla Chandrasekhar <clingutla@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
      c52232a4
  13. 27 2月, 2018 1 次提交
    • P
      printk: Wake klogd when passing console_lock owner · c14376de
      Petr Mladek 提交于
      wake_klogd is a local variable in console_unlock(). The information
      is lost when the console_lock owner using the busy wait added by
      the commit dbdda842 ("printk: Add console owner and waiter
      logic to load balance console writes"). The following race is
      possible:
      
      CPU0				CPU1
      console_unlock()
      
        for (;;)
           /* calling console for last message */
      
      				printk()
      				  log_store()
      				    log_next_seq++;
      
           /* see new message */
           if (seen_seq != log_next_seq) {
      	wake_klogd = true;
      	seen_seq = log_next_seq;
           }
      
           console_lock_spinning_enable();
      
      				  if (console_trylock_spinning())
      				     /* spinning */
      
           if (console_lock_spinning_disable_and_check()) {
      	printk_safe_exit_irqrestore(flags);
      	return;
      
      				  console_unlock()
      				    if (seen_seq != log_next_seq) {
      				    /* already seen */
      				    /* nothing to do */
      
      Result: Nobody would wakeup klogd.
      
      One solution would be to make a global variable from wake_klogd.
      But then we would need to manipulate it under a lock or so.
      
      This patch wakes klogd also when console_lock is passed to the
      spinning waiter. It looks like the right way to go. Also userspace
      should have a chance to see and store any "flood" of messages.
      
      Note that the very late klogd wake up was a historic solution.
      It made sense on single CPU systems or when sys_syslog() operations
      were synchronized using the big kernel lock like in v2.1.113.
      But it is questionable these days.
      
      Fixes: dbdda842 ("printk: Add console owner and waiter logic to load balance console writes")
      Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      c14376de
  14. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  15. 23 2月, 2018 4 次提交
    • T
      genirq/matrix: Handle CPU offlining proper · 651ca2c0
      Thomas Gleixner 提交于
      At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
      the allocated interrupt bits are discarded under the assumption that all
      allocated bits have been either migrated away or shut down through the
      managed interrupts mechanism.
      
      This is not true because interrupts which are not started up might have a
      vector allocated on the outgoing CPU. When the interrupt is started up
      later or completely shutdown and freed then the allocated vector is handed
      back, triggering warnings or causing accounting issues which result in
      suspend failures and other issues.
      
      Change the CPU hotplug mechanism of the matrix allocator so that the
      remaining allocations at unplug time are preserved and global accounting at
      hotplug is correctly readjusted to take the dormant vectors into account.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
      651ca2c0
    • Y
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song 提交于
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • E
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet 提交于
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32fff239
    • L
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony 提交于
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  16. 22 2月, 2018 4 次提交
  17. 21 2月, 2018 3 次提交
  18. 17 2月, 2018 1 次提交
  19. 16 2月, 2018 4 次提交
    • A
      irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro · 0b24a0bb
      Andy Shevchenko 提交于
      ...instead of open coding file operations followed by custom ->open()
      callbacks per each attribute.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0b24a0bb
    • J
      kprobes: Propagate error from disarm_kprobe_ftrace() · 297f9233
      Jessica Yu 提交于
      Improve error handling when disarming ftrace-based kprobes. Like with
      arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
      that we do not disable/unregister kprobes that are still armed. In other
      words, unregister_kprobe() and disable_kprobe() should not report success
      if the kprobe could not be disarmed.
      
      disarm_all_kprobes() keeps its current behavior and attempts to
      disarm all kprobes. It returns the last encountered error and gives a
      warning if not all probes could be disarmed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297f9233
    • J
      kprobes: Propagate error from arm_kprobe_ftrace() · 12310e34
      Jessica Yu 提交于
      Improve error handling when arming ftrace-based kprobes. Specifically, if
      we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
      should report an error instead of success. Previously, this has lead to
      confusing situations where register_kprobe() would return 0 indicating
      success, but the kprobe would not be functional if ftrace registration
      during the kprobe arming process had failed. We should therefore take any
      errors returned by ftrace into account and propagate this error so that we
      do not register/enable kprobes that cannot be armed. This can happen if,
      for example, register_ftrace_function() finds an IPMODIFY conflict (since
      kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
      is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.
      
      arm_all_kprobes() keeps its current behavior and attempts to arm all
      kprobes. It returns the last encountered error and gives a warning if
      not all probes could be armed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12310e34
    • D
      bpf: fix mlock precharge on arraymaps · 9c2d63b8
      Daniel Borkmann 提交于
      syzkaller recently triggered OOM during percpu map allocation;
      while there is work in progress by Dennis Zhou to add __GFP_NORETRY
      semantics for percpu allocator under pressure, there seems also a
      missing bpf_map_precharge_memlock() check in array map allocation.
      
      Given today the actual bpf_map_charge_memlock() happens after the
      find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
      is there to bail out early before we go and do the map setup work
      when we find that we hit the limits anyway. Therefore add this for
      array map as well.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c2d63b8
  20. 15 2月, 2018 2 次提交