1. 20 3月, 2018 2 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      sockmap: convert refcnt to an atomic refcnt · ffa35660
      John Fastabend 提交于
      The sockmap refcnt up until now has been wrapped in the
      sk_callback_lock(). So its not actually needed any locking of its
      own. The counter itself tracks the lifetime of the psock object.
      Sockets in a sockmap have a lifetime that is independent of the
      map they are part of. This is possible because a single socket may
      be in multiple maps. When this happens we can only release the
      psock data associated with the socket when the refcnt reaches
      zero. There are three possible delete sock reference decrement
      paths first through the normal sockmap process, the user deletes
      the socket from the map. Second the map is removed and all sockets
      in the map are removed, delete path is similar to case 1. The third
      case is an asyncronous socket event such as a closing the socket. The
      last case handles removing sockets that are no longer available.
      For completeness, although inc does not pose any problems in this
      patch series, the inc case only happens when a psock is added to a
      map.
      
      Next we plan to add another socket prog type to handle policy and
      monitoring on the TX path. When we do this however we will need to
      keep a reference count open across the sendmsg/sendpage call and
      holding the sk_callback_lock() here (on every send) seems less than
      ideal, also it may sleep in cases where we hit memory pressure.
      Instead of dealing with these issues in some clever way simply make
      the reference counting a refcnt_t type and do proper atomic ops.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ffa35660
  2. 15 3月, 2018 1 次提交
    • S
      bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7
      Song Liu 提交于
      Currently, bpf stackmap store address for each entry in the call trace.
      To map these addresses to user space files, it is necessary to maintain
      the mapping from these virtual address to symbols in the binary. Usually,
      the user space profiler (such as perf) has to scan /proc/pid/maps at the
      beginning of profiling, and monitor mmap2() calls afterwards. Given the
      cost of maintaining the address map, this solution is not practical for
      system wide profiling that is always on.
      
      This patch tries to solve this problem with a variation of stackmap. This
      variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
      addresses, the variation stores ELF file build_id + offset.
      
      Build ID is a 20-byte unique identifier for ELF files. The following
      command shows the Build ID of /bin/bash:
      
        [user@]$ readelf -n /bin/bash
        ...
          Build ID: XXXXXXXXXX
        ...
      
      With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
      for each entry in the call trace, and translate it into the following
      struct:
      
        struct bpf_stack_build_id_offset {
                __s32           status;
                unsigned char   build_id[BPF_BUILD_ID_SIZE];
                union {
                        __u64   offset;
                        __u64   ip;
                };
        };
      
      The search of build_id is limited to the first page of the file, and this
      page should be in page cache. Otherwise, we fallback to store ip for this
      entry (ip field in struct bpf_stack_build_id_offset). This requires the
      build_id to be stored in the first page. A quick survey of binary and
      dynamic library files in a few different systems shows that almost all
      binary and dynamic library files have build_id in the first page.
      
      Build_id is only meaningful for user stack. If a kernel stack is added to
      a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
      only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
      lookup failed for some reason, it will also fallback to store ip.
      
      User space can access struct bpf_stack_build_id_offset with bpf
      syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
      maintain mapping from build id to binary files. This mostly static
      mapping is much easier to maintain than per process address maps.
      
      Note: Stackmap with build_id only works in non-nmi context at this time.
      This is because we need to take mm->mmap_sem for find_vma(). If this
      changes, we would like to allow build_id lookup in nmi context.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      615755a7
  3. 09 3月, 2018 1 次提交
    • Q
      bpf: comment why dots in filenames under BPF virtual FS are not allowed · 6d8cb045
      Quentin Monnet 提交于
      When pinning a file under the BPF virtual file system (traditionally
      /sys/fs/bpf), using a dot in the name of the location to pin at is not
      allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
      rejected with -EPERM.
      
      This check was introduced at the same time as the BPF file system
      itself, with commit b2197755 ("bpf: add support for persistent
      maps/progs"). At this time, it was checked in a function called
      "bpf_dname_reserved()", which made clear that using a dot was reserved
      for future extensions.
      
      This function disappeared and the check was moved elsewhere with commit
      0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the
      meaning of the dot ban was lost.
      
      The present commit simply adds a comment in the source to explain to the
      reader that the usage of dots is reserved for future usage.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6d8cb045
  4. 08 3月, 2018 1 次提交
  5. 03 3月, 2018 1 次提交
  6. 01 3月, 2018 1 次提交
    • L
      timers: Forward timer base before migrating timers · c52232a4
      Lingutla Chandrasekhar 提交于
      On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
      live CPU. This happens from the control thread which initiated the unplug.
      
      If the CPU on which the control thread runs came out from a longer idle
      period then the base clock of that CPU might be stale because the control
      thread runs prior to any event which forwards the clock.
      
      In such a case the timers from the unplugged CPU are queued on the live CPU
      based on the stale clock which can cause large delays due to increased
      granularity of the outer timer wheels which are far away from base:;clock.
      
      But there is a worse problem than that. The following sequence of events
      illustrates it:
      
       - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.
      
         The timer is queued at wheel level 2, with resulting expiry time = 60032
         (due to level granularity).
      
       - CPU1 enters idle @60007, with next timer expiry @60020.
      
       - CPU0 is hotplugged at @60009
      
       - CPU1 exits idle and runs the control thread which migrates the
         timers from CPU0
      
         timer1 is now queued in level 0 for immediate handling in the next
         softirq because the requested expiry time 59969 is before CPU1 base->clk
         60007
      
       - CPU1 runs code which forwards the base clock which succeeds because the
         next expiring timer. which was collected at idle entry time is still set
         to 60020.
      
         So it forwards beyond 60007 and therefore misses to expire the migrated
         timer1. That timer gets expired when the wheel wraps around again, which
         takes between 63 and 630ms depending on the HZ setting.
      
      Address both problems by invoking forward_timer_base() for the control CPUs
      timer base. All other places, which might run into a similar problem
      (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
      that.
      
      [ tglx: Massaged comment and changelog ]
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Co-developed-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NLingutla Chandrasekhar <clingutla@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
      c52232a4
  7. 27 2月, 2018 1 次提交
    • P
      printk: Wake klogd when passing console_lock owner · c14376de
      Petr Mladek 提交于
      wake_klogd is a local variable in console_unlock(). The information
      is lost when the console_lock owner using the busy wait added by
      the commit dbdda842 ("printk: Add console owner and waiter
      logic to load balance console writes"). The following race is
      possible:
      
      CPU0				CPU1
      console_unlock()
      
        for (;;)
           /* calling console for last message */
      
      				printk()
      				  log_store()
      				    log_next_seq++;
      
           /* see new message */
           if (seen_seq != log_next_seq) {
      	wake_klogd = true;
      	seen_seq = log_next_seq;
           }
      
           console_lock_spinning_enable();
      
      				  if (console_trylock_spinning())
      				     /* spinning */
      
           if (console_lock_spinning_disable_and_check()) {
      	printk_safe_exit_irqrestore(flags);
      	return;
      
      				  console_unlock()
      				    if (seen_seq != log_next_seq) {
      				    /* already seen */
      				    /* nothing to do */
      
      Result: Nobody would wakeup klogd.
      
      One solution would be to make a global variable from wake_klogd.
      But then we would need to manipulate it under a lock or so.
      
      This patch wakes klogd also when console_lock is passed to the
      spinning waiter. It looks like the right way to go. Also userspace
      should have a chance to see and store any "flood" of messages.
      
      Note that the very late klogd wake up was a historic solution.
      It made sense on single CPU systems or when sys_syslog() operations
      were synchronized using the big kernel lock like in v2.1.113.
      But it is questionable these days.
      
      Fixes: dbdda842 ("printk: Add console owner and waiter logic to load balance console writes")
      Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      c14376de
  8. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  9. 23 2月, 2018 4 次提交
    • T
      genirq/matrix: Handle CPU offlining proper · 651ca2c0
      Thomas Gleixner 提交于
      At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
      the allocated interrupt bits are discarded under the assumption that all
      allocated bits have been either migrated away or shut down through the
      managed interrupts mechanism.
      
      This is not true because interrupts which are not started up might have a
      vector allocated on the outgoing CPU. When the interrupt is started up
      later or completely shutdown and freed then the allocated vector is handed
      back, triggering warnings or causing accounting issues which result in
      suspend failures and other issues.
      
      Change the CPU hotplug mechanism of the matrix allocator so that the
      remaining allocations at unplug time are preserved and global accounting at
      hotplug is correctly readjusted to take the dormant vectors into account.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
      651ca2c0
    • Y
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song 提交于
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • E
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet 提交于
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32fff239
    • L
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony 提交于
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  10. 22 2月, 2018 3 次提交
  11. 21 2月, 2018 3 次提交
  12. 17 2月, 2018 1 次提交
  13. 16 2月, 2018 4 次提交
    • A
      irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro · 0b24a0bb
      Andy Shevchenko 提交于
      ...instead of open coding file operations followed by custom ->open()
      callbacks per each attribute.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0b24a0bb
    • J
      kprobes: Propagate error from disarm_kprobe_ftrace() · 297f9233
      Jessica Yu 提交于
      Improve error handling when disarming ftrace-based kprobes. Like with
      arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
      that we do not disable/unregister kprobes that are still armed. In other
      words, unregister_kprobe() and disable_kprobe() should not report success
      if the kprobe could not be disarmed.
      
      disarm_all_kprobes() keeps its current behavior and attempts to
      disarm all kprobes. It returns the last encountered error and gives a
      warning if not all probes could be disarmed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297f9233
    • J
      kprobes: Propagate error from arm_kprobe_ftrace() · 12310e34
      Jessica Yu 提交于
      Improve error handling when arming ftrace-based kprobes. Specifically, if
      we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
      should report an error instead of success. Previously, this has lead to
      confusing situations where register_kprobe() would return 0 indicating
      success, but the kprobe would not be functional if ftrace registration
      during the kprobe arming process had failed. We should therefore take any
      errors returned by ftrace into account and propagate this error so that we
      do not register/enable kprobes that cannot be armed. This can happen if,
      for example, register_ftrace_function() finds an IPMODIFY conflict (since
      kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
      is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.
      
      arm_all_kprobes() keeps its current behavior and attempts to arm all
      kprobes. It returns the last encountered error and gives a warning if
      not all probes could be armed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12310e34
    • D
      bpf: fix mlock precharge on arraymaps · 9c2d63b8
      Daniel Borkmann 提交于
      syzkaller recently triggered OOM during percpu map allocation;
      while there is work in progress by Dennis Zhou to add __GFP_NORETRY
      semantics for percpu allocator under pressure, there seems also a
      missing bpf_map_precharge_memlock() check in array map allocation.
      
      Given today the actual bpf_map_charge_memlock() happens after the
      find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
      is there to bail out early before we go and do the map setup work
      when we find that we hit the limits anyway. Therefore add this for
      array map as well.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c2d63b8
  14. 15 2月, 2018 2 次提交
  15. 14 2月, 2018 3 次提交
  16. 13 2月, 2018 7 次提交
  17. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  18. 08 2月, 2018 2 次提交
  19. 07 2月, 2018 1 次提交
    • A
      x86: hibernate: fix swsusp_arch_resume() prototype · 168b6511
      Arnd Bergmann 提交于
      The declaration for swsusp_arch_resume() marks it as 'asmlinkage',
      but the definition in x86-32 does not, and it fails to include
      the header with the declaration.  This leads to a warning when
      building with link-time-optimizations:
      
      kernel/power/power.h:108:23: error: type of 'swsusp_arch_resume' does not match original declaration [-Werror=lto-type-mismatch]
       extern asmlinkage int swsusp_arch_resume(void);
                             ^
      arch/x86/power/hibernate_32.c:148:0: note: 'swsusp_arch_resume' was previously declared here
       int swsusp_arch_resume(void)
      
      This moves the declaration into a globally visible header file
      and fixes up both x86 definitions to match it.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      168b6511