1. 20 8月, 2017 1 次提交
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  2. 19 8月, 2017 2 次提交
  3. 18 8月, 2017 2 次提交
  4. 17 8月, 2017 6 次提交
  5. 16 8月, 2017 2 次提交
    • D
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann 提交于
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a5c690
    • E
      bpf/verifier: track liveness for pruning · dc503a8a
      Edward Cree 提交于
      State of a register doesn't matter if it wasn't read in reaching an exit;
       a write screens off all reads downstream of it from all explored_states
       upstream of it.
      This allows us to prune many more branches; here are some processed insn
       counts for some Cilium programs:
      Program                  before  after
      bpf_lb_opt_-DLB_L3.o       6515   3361
      bpf_lb_opt_-DLB_L4.o       8976   5176
      bpf_lb_opt_-DUNKNOWN.o     2960   1137
      bpf_lxc_opt_-DDROP_ALL.o  95412  48537
      bpf_lxc_opt_-DUNKNOWN.o  141706  78718
      bpf_netdev.o              24251  17995
      bpf_overlay.o             10999   9385
      
      The runtime is also improved; here are 'time' results in ms:
      Program                  before  after
      bpf_lb_opt_-DLB_L3.o         24      6
      bpf_lb_opt_-DLB_L4.o         26     11
      bpf_lb_opt_-DUNKNOWN.o       11      2
      bpf_lxc_opt_-DDROP_ALL.o   1288    139
      bpf_lxc_opt_-DUNKNOWN.o    1768    234
      bpf_netdev.o                 62     31
      bpf_overlay.o                15     13
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc503a8a
  6. 11 8月, 2017 2 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  7. 10 8月, 2017 3 次提交
    • D
      bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier · b4e432f1
      Daniel Borkmann 提交于
      Enable the newly added jump opcodes, main parts are in two
      different areas, namely direct packet access and dynamic map
      value access. For the direct packet access, we now allow for
      the following two new patterns to match in order to trigger
      markings with find_good_pkt_pointers():
      
      Variant 1 (access ok when taking the branch):
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (bf) r0 = r2
        3: (07) r0 += 8
        4: (ad) if r0 < r3 goto pc+2
        R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
        R3=pkt_end R10=fp
        5: (b7) r0 = 0
        6: (95) exit
      
        from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
                     R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
        7: (71) r0 = *(u8 *)(r2 +0)
        8: (05) goto pc-4
        5: (b7) r0 = 0
        6: (95) exit
        processed 11 insns, stack depth 0
      
      Variant 2 (access ok on fall-through):
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (bf) r0 = r2
        3: (07) r0 += 8
        4: (bd) if r3 <= r0 goto pc+1
        R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
        R3=pkt_end R10=fp
        5: (71) r0 = *(u8 *)(r2 +0)
        6: (b7) r0 = 1
        7: (95) exit
      
        from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
                     R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
        6: (b7) r0 = 1
        7: (95) exit
        processed 10 insns, stack depth 0
      
      The above two basically just swap the branches where we need
      to handle an exception and allow packet access compared to the
      two already existing variants for find_good_pkt_pointers().
      
      For the dynamic map value access, we add the new instructions
      to reg_set_min_max() and reg_set_min_max_inv() in order to
      learn bounds. Verifier test cases for both are added in a
      follow-up patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4e432f1
    • D
      bpf: add BPF_J{LT,LE,SLT,SLE} instructions · 92b31a9a
      Daniel Borkmann 提交于
      Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
      BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
      particularly *JLT/*JLE counterparts involving immediates need
      to be rewritten from e.g. X < [IMM] by swapping arguments into
      [IMM] > X, meaning the immediate first is required to be loaded
      into a register Y := [IMM], such that then we can compare with
      Y > X. Note that the destination operand is always required to
      be a register.
      
      This has the downside of having unnecessarily increased register
      pressure, meaning complex program would need to spill other
      registers temporarily to stack in order to obtain an unused
      register for the [IMM]. Loading to registers will thus also
      affect state pruning since we need to account for that register
      use and potentially those registers that had to be spilled/filled
      again. As a consequence slightly more stack space might have
      been used due to spilling, and BPF programs are a bit longer
      due to extra code involving the register load and potentially
      required spill/fills.
      
      Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
      counterparts to the eBPF instruction set. Modifying LLVM to
      remove the NegateCC() workaround in a PoC patch at [1] and
      allowing it to also emit the new instructions resulted in
      cilium's BPF programs that are injected into the fast-path to
      have a reduced program length in the range of 2-3% (e.g.
      accumulated main and tail call sections from one of the object
      file reduced from 4864 to 4729 insns), reduced complexity in
      the range of 10-30% (e.g. accumulated sections reduced in one
      of the cases from 116432 to 88428 insns), and reduced stack
      usage in the range of 1-5% (e.g. accumulated sections from one
      of the object files reduced from 824 to 784b).
      
      The modification for LLVM will be incorporated in a backwards
      compatible way. Plan is for LLVM to have i) a target specific
      option to offer a possibility to explicitly enable the extension
      by the user (as we have with -m target specific extensions today
      for various CPU insns), and ii) have the kernel checked for
      presence of the extensions and enable them transparently when
      the user is selecting more aggressive options such as -march=native
      in a bpf target context. (Other frontends generating BPF byte
      code, e.g. ply can probe the kernel directly for its code
      generation.)
      
        [1] https://github.com/borkmann/llvm/tree/bpf-insnsSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92b31a9a
    • M
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman 提交于
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
  8. 09 8月, 2017 6 次提交
  9. 08 8月, 2017 2 次提交
    • J
      bpf: devmap fix mutex in rcu critical section · 4cc7b954
      John Fastabend 提交于
      Originally we used a mutex to protect concurrent devmap update
      and delete operations from racing with netdev unregister notifier
      callbacks.
      
      The notifier hook is needed because we increment the netdev ref
      count when a dev is added to the devmap. This ensures the netdev
      reference is valid in the datapath. However, we don't want to block
      unregister events, hence the initial mutex and notifier handler.
      
      The concern was in the notifier hook we search the map for dev
      entries that hold a refcnt on the net device being torn down. But,
      in order to do this we require two steps,
      
        (i) dereference the netdev:  dev = rcu_dereference(map[i])
       (ii) test ifindex:   dev->ifindex == removing_ifindex
      
      and then finally we can swap in the NULL dev in the map via an
      xchg operation,
      
        xchg(map[i], NULL)
      
      The danger here is a concurrent update could run a different
      xchg op concurrently leading us to replace the new dev with a
      NULL dev incorrectly.
      
            CPU 1                        CPU 2
      
         notifier hook                   bpf devmap update
      
         dev = rcu_dereference(map[i])
                                         dev = rcu_dereference(map[i])
                                         xchg(map[i]), new_dev);
                                         rcu_call(dev,...)
         xchg(map[i], NULL)
      
      The above flow would create the incorrect state with the dev
      reference in the update path being lost. To resolve this the
      original code used a mutex around the above block. However,
      updates, deletes, and lookups occur inside rcu critical sections
      so we can't use a mutex in this context safely.
      
      Fortunately, by writing slightly better code we can avoid the
      mutex altogether. If CPU 1 in the above example uses a cmpxchg
      and _only_ replaces the dev reference in the map when it is in
      fact the expected dev the race is removed completely. The two
      cases being illustrated here, first the race condition,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
      
         dev = rcu_dereference(map[i])
                                           dev = rcu_dereference(map[i])
                                           xchg(map[i]), new_dev);
                                           rcu_call(dev,...)
         odev = cmpxchg(map[i], dev, NULL)
      
      Now we can test the cmpxchg return value, detect odev != dev and
      abort. Or in the good case,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
         dev = rcu_dereference(map[i])
         odev = cmpxchg(map[i], dev, NULL)
                                           [...]
      
      Now 'odev == dev' and we can do proper cleanup.
      
      And viola the original race we tried to solve with a mutex is
      corrected and the trace noted by Sasha below is resolved due
      to removal of the mutex.
      
      Note: When walking the devmap and removing dev references as needed
      we depend on the core to fail any calls to dev_get_by_index() using
      the ifindex of the device being removed. This way we do not race with
      the user while searching the devmap.
      
      Additionally, the mutex was also protecting list add/del/read on
      the list of maps in-use. This patch converts this to an RCU list
      and spinlock implementation. This protects the list from concurrent
      alloc/free operations. The notifier hook walks this list so it uses
      RCU read semantics.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
      in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
      1 lock held by syz-executor1/16315:
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
      
      Fixes: 2ddf71e2 ("net: add notifier hooks for devmap bpf map")
      Reported-by: NSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cc7b954
    • Y
      bpf: add support for sys_enter_* and sys_exit_* tracepoints · cf5f5cea
      Yonghong Song 提交于
      Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
      style tracepoints. The iovisor/bcc issue #748
      (https://github.com/iovisor/bcc/issues/748) documents this issue.
      For example, if you try to attach a bpf program to tracepoints
      syscalls/sys_enter_newfstat, you will get the following error:
         # ./tools/trace.py t:syscalls:sys_enter_newfstat
         Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
         Failed to attach BPF to tracepoint
      
      The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
      tracepoints are treated differently from other tracepoints and there
      is no bpf hook to it.
      
      This patch adds bpf support for these syscalls tracepoints by
        . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
        . calling bpf programs in perf_syscall_enter and perf_syscall_exit
      
      The legality of bpf program ctx access is also checked.
      Function trace_event_get_offsets returns correct max offset for each
      specific syscall tracepoint, which is compared against the maximum offset
      access in bpf program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf5f5cea
  10. 07 8月, 2017 1 次提交
  11. 03 8月, 2017 2 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • K
      pid: kill pidhash_size in pidhash_init() · 27e37d84
      Kefeng Wang 提交于
      After commit 3d375d78 ("mm: update callers to use HASH_ZERO flag"),
      drop unused pidhash_size in pidhash_init().
      
      Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NPavel Tatashin <Pasha.Tatashin@Oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27e37d84
  12. 01 8月, 2017 1 次提交
  13. 30 7月, 2017 2 次提交
  14. 28 7月, 2017 1 次提交
    • M
      workqueue: Work around edge cases for calc of pool's cpumask · 1ad0f0a7
      Michael Bringmann 提交于
      There is an underlying assumption/trade-off in many layers of the Linux
      system that CPU <-> node mapping is static.  This is despite the presence
      of features like NUMA and 'hotplug' that support the dynamic addition/
      removal of fundamental system resources like CPUs and memory.  PowerPC
      systems, however, do provide extensive features for the dynamic change
      of resources available to a system.
      
      Currently, there is little or no synchronization protection around the
      updating of the CPU <-> node mapping, and the export/update of this
      information for other layers / modules.  In systems which can change
      this mapping during 'hotplug', like PowerPC, the information is changing
      underneath all layers that might reference it.
      
      This patch attempts to ensure that a valid, usable cpumask attribute
      is used by the workqueue infrastructure when setting up new resource
      pools.  It prevents a crash that has been observed when an 'empty'
      cpumask is passed along to the worker/task scheduling code.  It is
      intended as a temporary workaround until a more fundamental review and
      correction of the issue can be done.
      
      [With additions to the patch provided by Tejun Hao <tj@kernel.org>]
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1ad0f0a7
  15. 27 7月, 2017 1 次提交
    • T
      genirq/cpuhotplug: Revert "Set force affinity flag on hotplug migration" · 83979133
      Thomas Gleixner 提交于
      That commit was part of the changes moving x86 to the generic CPU hotplug
      interrupt migration code. The force flag was required on x86 before the
      hierarchical irqdomain rework, but invoking set_affinity() with force=true
      stayed and had no side effects.
      
      At some point in the past, the force flag got repurposed to support the
      exynos timer interrupt affinity setting to a not yet online CPU, so the
      interrupt controller callback does not verify the supplied affinity mask
      against cpu_online_mask.
      
      Setting the flag in the CPU hotplug code causes the cpu online masking to
      be blocked on these irq controllers and results in potentially affining an
      interrupt to the CPU which is unplugged, i.e. instead of moving it away,
      it's just reassigned to it.
      
      As the force flags is not longer needed on x86, it's safe to revert that
      patch so the ARM irqchips which use the force flag work again.
      
      Add comments to that effect, so this won't happen again.
      
      Note: The online mask handling should be done in the generic code and the
      force flag and the masking in the irq chips removed all together, but
      that's not a change possible for 4.13. 
      
      Fixes: 77f85e66 ("genirq/cpuhotplug: Set force affinity flag on hotplug migration")
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: LAK <linux-arm-kernel@lists.infradead.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707271217590.3109@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      83979133
  16. 26 7月, 2017 1 次提交
    • T
      workqueue: implicit ordered attribute should be overridable · 0a94efb5
      Tejun Heo 提交于
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered") automatically enabled ordered attribute for unbound
      workqueues w/ max_active == 1.  Because ordered workqueues reject
      max_active and some attribute changes, this implicit ordered mode
      broke cases where the user creates an unbound workqueue w/ max_active
      == 1 and later explicitly changes the related attributes.
      
      This patch distinguishes explicit and implicit ordered setting and
      overrides from attribute changes if implict.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      0a94efb5
  17. 25 7月, 2017 3 次提交
  18. 23 7月, 2017 1 次提交
  19. 21 7月, 2017 1 次提交
    • J
      perf/core: Fix locking for children siblings group read · 2aeb1883
      Jiri Olsa 提交于
      We're missing ctx lock when iterating children siblings
      within the perf_read path for group reading. Following
      race and crash can happen:
      
      User space doing read syscall on event group leader:
      
      T1:
        perf_read
          lock event->ctx->mutex
          perf_read_group
            lock leader->child_mutex
            __perf_read_group_add(child)
              list_for_each_entry(sub, &leader->sibling_list, group_entry)
      
      ---->   sub might be invalid at this point, because it could
              get removed via perf_event_exit_task_context in T2
      
      Child exiting and cleaning up its events:
      
      T2:
        perf_event_exit_task_context
          lock ctx->mutex
          list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
            perf_event_exit_event(child)
              lock ctx->lock
              perf_group_detach(child)
              unlock ctx->lock
      
      ---->   child is removed from sibling_list without any sync
              with T1 path above
      
              ...
              free_event(child)
      
      Before the child is removed from the leader's child_list,
      (and thus is omitted from perf_read_group processing), we
      need to ensure that perf_read_group touches child's
      siblings under its ctx->lock.
      
      Peter further notes:
      
      | One additional note; this bug got exposed by commit:
      |
      |   ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      |
      | which made it possible to actually trigger this code-path.
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aeb1883