1. 09 9月, 2017 11 次提交
  2. 07 9月, 2017 5 次提交
  3. 06 9月, 2017 1 次提交
    • E
      bpf: fix numa_node validation · 96e5ae4e
      Eric Dumazet 提交于
      syzkaller reported crashes in bpf map creation or map update [1]
      
      Problem is that nr_node_ids is a signed integer,
      NUMA_NO_NODE is also an integer, so it is very tempting
      to declare numa_node as a signed integer.
      
      This means the typical test to validate a user provided value :
      
              if (numa_node != NUMA_NO_NODE &&
                  (numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      must be written :
      
              if (numa_node != NUMA_NO_NODE &&
                  ((unsigned int)numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      [1]
      kernel BUG at mm/slab.c:3256!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d2bc60c0 task.stack: ffff8801c0c90000
      RIP: 0010:____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
      RSP: 0018:ffff8801c0c97638 EFLAGS: 00010096
      RAX: ffffffffffff8b7b RBX: 0000000001080220 RCX: 0000000000000000
      RDX: 00000000ffff8b7b RSI: 0000000001080220 RDI: ffff8801dac00040
      RBP: ffff8801c0c976c0 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8801c0c97620 R11: 0000000000000001 R12: ffff8801dac00040
      R13: ffff8801dac00040 R14: 0000000000000000 R15: 00000000ffff8b7b
      FS:  0000000002119940(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001fec CR3: 00000001d2980000 CR4: 00000000001406f0
      Call Trace:
       __do_kmalloc_node mm/slab.c:3688 [inline]
       __kmalloc_node+0x33/0x70 mm/slab.c:3696
       kmalloc_node include/linux/slab.h:535 [inline]
       alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
       htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
       map_update_elem kernel/bpf/syscall.c:587 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
       SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x440409
      RSP: 002b:00007ffd1f1792b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440409
      RDX: 0000000000000020 RSI: 0000000020006000 RDI: 0000000000000002
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401d70
      R13: 0000000000401e00 R14: 0000000000000000 R15: 0000000000000000
      Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
      RIP: ____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: ffff8801c0c97638
      ---[ end trace d745f355da2e33ce ]---
      Kernel panic - not syncing: Fatal exception
      
      Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96e5ae4e
  4. 05 9月, 2017 2 次提交
    • G
      audit: update the function comments · 196a5085
      Geliang Tang 提交于
      Update the function comments to match the code.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      196a5085
    • M
      audit: Reduce overhead using a coarse clock · e832bf48
      Mel Gorman 提交于
      Commit 2115bb25 ("audit: Use timespec64 to represent audit timestamps")
      noted that audit timestamps were not y2038 safe and used a 64-bit
      timestamp. In itself, this makes sense but the conversion was from
      CURRENT_TIME to ktime_get_real_ts64() which is a heavier call to record
      an accurate timestamp which is required in some, but not all, cases. The
      impact is that when auditd is running without any rules that all syscalls
      have higher overhead. This is visible in the sysbench-thread benchmark as
      a 11.5% performance hit. That benchmark is dumb as rocks but it's also
      visible in redis as an 8-10% hit on all operations which is of greater
      concern. It is somewhat stupid of audit to track syscalls without any
      rules related to syscalls but that is how it behaves.
      
      The overhead can be directly measured with perf comparing 4.9 with 4.12
      
      4.9
           7.76%  sysbench         [kernel.vmlinux]    [k] __schedule
           7.62%  sysbench         [kernel.vmlinux]    [k] _raw_spin_lock
           7.37%  sysbench         libpthread-2.22.so  [.] __lll_lock_elision
           7.29%  sysbench         [kernel.vmlinux]    [.] syscall_return_via_sysret
           6.59%  sysbench         [kernel.vmlinux]    [k] native_sched_clock
           5.21%  sysbench         libc-2.22.so        [.] __sched_yield
           4.38%  sysbench         [kernel.vmlinux]    [k] entry_SYSCALL_64
           4.28%  sysbench         [kernel.vmlinux]    [k] do_syscall_64
           3.49%  sysbench         libpthread-2.22.so  [.] __lll_unlock_elision
           3.13%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_exit
           2.87%  sysbench         [kernel.vmlinux]    [k] update_curr
           2.73%  sysbench         [kernel.vmlinux]    [k] pick_next_task_fair
           2.31%  sysbench         [kernel.vmlinux]    [k] syscall_trace_enter
           2.20%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_entry
      .....
           0.00%  swapper          [kernel.vmlinux]    [k] read_tsc
      
      4.12
           7.84%  sysbench         [kernel.vmlinux]    [k] __schedule
           7.05%  sysbench         [kernel.vmlinux]    [k] _raw_spin_lock
           6.57%  sysbench         libpthread-2.22.so  [.] __lll_lock_elision
           6.50%  sysbench         [kernel.vmlinux]    [.] syscall_return_via_sysret
           5.95%  sysbench         [kernel.vmlinux]    [k] read_tsc
           5.71%  sysbench         [kernel.vmlinux]    [k] native_sched_clock
           4.78%  sysbench         libc-2.22.so        [.] __sched_yield
           4.30%  sysbench         [kernel.vmlinux]    [k] entry_SYSCALL_64
           3.94%  sysbench         [kernel.vmlinux]    [k] do_syscall_64
           3.37%  sysbench         libpthread-2.22.so  [.] __lll_unlock_elision
           3.32%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_exit
           2.91%  sysbench         [kernel.vmlinux]    [k] __getnstimeofday64
      
      Note the additional overhead from read_tsc which goes from 0% to 5.95%.
      This is on a single-socket E3-1230 but similar overheads have been measured
      on an older machine which the patch also eliminates.
      
      The patch in question has no explanation as to why a fully-accurate timestamp
      is required and is likely an oversight.  Using a coarser, but monotically
      increasing, timestamp the overhead can be eliminated.  While it can be
      worked around by configuring or disabling audit, it's tricky enough to
      detect that a kernel fix is justified. With this patch, we see the following;
      
      sysbenchthread
                                    4.9.0                 4.12.0                 4.12.0
                                  vanilla                vanilla            coarse-v1r1
      Amean     1         1.49 (   0.00%)        1.66 ( -11.42%)        1.51 (  -1.34%)
      Amean     3         1.48 (   0.00%)        1.65 ( -11.45%)        1.50 (  -0.96%)
      Amean     5         1.49 (   0.00%)        1.67 ( -12.31%)        1.51 (  -1.83%)
      Amean     7         1.49 (   0.00%)        1.66 ( -11.72%)        1.50 (  -0.67%)
      Amean     12        1.48 (   0.00%)        1.65 ( -11.57%)        1.52 (  -2.89%)
      Amean     16        1.49 (   0.00%)        1.65 ( -11.13%)        1.51 (  -1.73%)
      
      The benchmark is reporting the time required for different thread counts to
      lock/unlock a private mutex which, while dense, demonstrates the syscall
      overhead. This is showing that 4.12 took a 11-12% hit but the overhead is
      almost eliminated by the patch. While the variance is not reported here,
      it's well within the noise with the patch applied.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      e832bf48
  5. 02 9月, 2017 3 次提交
    • J
      bpf: sockmap update/simplify memory accounting scheme · 90a9631c
      John Fastabend 提交于
      Instead of tracking wmem_queued and sk_mem_charge by incrementing
      in the verdict SK_REDIRECT paths and decrementing in the tx work
      path use skb_set_owner_w and sock_writeable helpers. This solves
      a few issues with the current code. First, in SK_REDIRECT inc on
      sk_wmem_queued and sk_mem_charge were being done without the peers
      sock lock being held. Under stress this can result in accounting
      errors when tx work and/or multiple verdict decisions are working
      on the peer psock.
      
      Additionally, this cleans up the code because we can rely on the
      default destructor to decrement memory accounting on kfree_skb. Also
      this will trigger sk_write_space when space becomes available on
      kfree_skb() which wasn't happening before and prevent __sk_free
      from being called until all in-flight packets are completed.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90a9631c
    • M
      bpf: Only set node->ref = 1 if it has not been set · bb9b9f88
      Martin KaFai Lau 提交于
      This patch writes 'node->ref = 1' only if node->ref is 0.
      The number of lookups/s for a ~1M entries LRU map increased by
      ~30% (260097 to 343313).
      
      Other writes on 'node->ref = 0' is not changed.  In those cases, the
      same cache line has to be changed anyway.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 260097
      
      After:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 343313
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb9b9f88
    • M
      bpf: Inline LRU map lookup · cc555421
      Martin KaFai Lau 提交于
      Inline the lru map lookup to save the cost in making calls to
      bpf_map_lookup_elem() and htab_lru_map_lookup_elem().
      
      Different LRU hash size is tested.  The benefit diminishes when
      the cache miss starts to dominate in the bigger LRU hash.
      Considering the change is simple, it is still worth to optimize.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1132020
      1025: 1056826
      2049: 1007024
      4097: 853298
      8193: 742723
      16385: 712600
      32769: 688142
      65537: 677028
      131073: 619437
      262145: 498770
      524289: 316695
      1048577: 260038
      
      After:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1221851
      1025: 1144695
      2049: 1049902
      4097: 884460
      8193: 773731
      16385: 729673
      32769: 721989
      65537: 715530
      131073: 671665
      262145: 516987
      524289: 321125
      1048577: 260048
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc555421
  6. 01 9月, 2017 3 次提交
    • E
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
      
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      
      Here is the use-after-free reported by KASAN:
      
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
      
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
           dump_stack+0xdb/0x185
           print_address_description+0x7e/0x290
           kasan_report+0x23b/0x350
           __asan_report_load8_noabort+0x19/0x20
           uprobe_clear_state+0x1c4/0x200
           mmput+0xd6/0x360
           do_exit+0x740/0x1670
           do_group_exit+0x13f/0x380
           get_signal+0x597/0x17d0
           do_signal+0x99/0x1df0
           exit_to_usermode_loop+0x166/0x1e0
           syscall_return_slowpath+0x258/0x2c0
           entry_SYSCALL_64_fastpath+0xbc/0xbe
      
          ...
      
          Allocated by task 199:
           save_stack_trace+0x1b/0x20
           kasan_kmalloc+0xfc/0x180
           kmem_cache_alloc_trace+0xf3/0x330
           __create_xol_area+0x10f/0x780
           uprobe_notify_resume+0x1674/0x2210
           exit_to_usermode_loop+0x150/0x1e0
           prepare_exit_to_usermode+0x14b/0x180
           retint_user+0x8/0x20
      
          Freed by task 199:
           save_stack_trace+0x1b/0x20
           kasan_slab_free+0xa8/0x1a0
           kfree+0xba/0x210
           uprobe_clear_state+0x151/0x200
           mmput+0xd6/0x360
           copy_process.part.8+0x605f/0x65d0
           _do_fork+0x1a5/0xbd0
           SyS_clone+0x19/0x20
           do_syscall_64+0x22f/0x660
           return_from_SYSCALL_64+0x0/0x7a
      
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355627f5
    • S
      kernel/kthread.c: kthread_worker: don't hog the cpu · 22cf8bc6
      Shaohua Li 提交于
      If the worker thread continues getting work, it will hog the cpu and rcu
      stall complains.  Make it a good citizen.  This is triggered in a loop
      block device test.
      
      Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22cf8bc6
    • A
      alarmtimer: Ensure RTC module is not unloaded · 51218298
      Alexandre Belloni 提交于
      When registering the rtc device to be used to handle alarm timers,
      get_device is used to ensure the device doesn't go away but the module can
      still be unloaded.
      
      Call try_module_get to ensure the rtc driver will not go away.
      Reported-and-tested-by: NMichal Simek <monstr@monstr.eu>
      Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Link: http://lkml.kernel.org/r/20170820220146.30969-1-alexandre.belloni@free-electrons.com
      51218298
  7. 29 8月, 2017 13 次提交
    • W
      locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures · 34d54f3d
      Waiman Long 提交于
      All the locking related cmpxchg's in the following functions are
      replaced with the _acquire variants:
      
       - pv_queued_spin_steal_lock()
       - trylock_clear_pending()
      
      This change should help performance on architectures that use LL/SC.
      
      The cmpxchg in pv_kick_node() is replaced with a relaxed version
      with explicit memory barrier to make sure that it is fully ordered
      in the writing of next->lock and the reading of pn->state whether
      the cmpxchg is a success or failure without affecting performance in
      non-LL/SC architectures.
      
      On a 2-socket 12-core 96-thread Power8 system with pvqspinlock
      explicitly enabled, the performance of a locking microbenchmark
      with and without this patch on a 4.13-rc4 kernel with Xinhui's PPC
      qspinlock patch were as follows:
      
        # of thread     w/o patch    with patch      % Change
        -----------     ---------    ----------      --------
             8         5054.8 Mop/s  5209.4 Mop/s     +3.1%
            16         3985.0 Mop/s  4015.0 Mop/s     +0.8%
            32         2378.2 Mop/s  2396.0 Mop/s     +0.7%
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/1502741222-24360-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      34d54f3d
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
    • P
      locking/lockdep: Untangle xhlock history save/restore from task independence · f52be570
      Peter Zijlstra 提交于
      Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
      ensure the temporal IRQ events don't interact with task state, the
      XHLOCK_PROC is a fundament different beast that just happens to share
      the interface.
      
      The purpose of XHLOCK_PROC is to annotate independent execution inside
      one task. For example workqueues, each work should appear to run in its
      own 'pristine' 'task'.
      
      Remove XHLOCK_PROC in favour of its own interface to avoid confusion.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: david@fromorbit.com
      Cc: johannes@sipsolutions.net
      Cc: kernel-team@lge.com
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f52be570
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
    • A
      perf/core, pt, bts: Get rid of itrace_started · 8d4e6c4c
      Alexander Shishkin 提交于
      I just noticed that hw.itrace_started and hw.config are aliased to the
      same location. Now, the PT driver happens to use both, which works out
      fine by sheer luck:
      
       - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
          program order, although there are no compiler barriers to ensure that,
      
       - to the perf_log_itrace_start() hw.itrace_start looks set at the same
         time as when it is intended to be set because both stores happen in the
         same path,
      
       - hw.config is never reset to zero in the PT driver.
      
      Now, the use of hw.config by the PT driver makes more sense (it being a
      HW PMU) than messing around with itrace_started, which is an awkward API
      to begin with.
      
      This patch replaces hw.itrace_started with an attach_state bit and an
      API call for the PMU drivers to use to communicate the condition.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8d4e6c4c
    • Z
      perf/ftrace: Fix double traces of perf on ftrace:function · 75e83876
      Zhou Chengming 提交于
      When running perf on the ftrace:function tracepoint, there is a bug
      which can be reproduced by:
      
        perf record -e ftrace:function -a sleep 20 &
        perf record -e ftrace:function ls
        perf script
      
                    ls 10304 [005]   171.853235: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853237: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853239: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853240: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853242: ftrace:function:
        __task_pid_nr_ns
                    ls 10304 [005]   171.853244: ftrace:function:
        __task_pid_nr_ns
      
      We can see that all the function traces are doubled.
      
      The problem is caused by the inconsistency of the register
      function perf_ftrace_event_register() with the probe function
      perf_ftrace_function_call(). The former registers one probe
      for every perf_event. And the latter handles all perf_events
      on the current cpu. So when two perf_events on the current cpu,
      the traces of them will be doubled.
      
      So this patch adds an extra parameter "event" for perf_tp_event,
      only send sample data to this event when it's not NULL.
      Signed-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: huawei.libin@huawei.com
      Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75e83876
    • M
      perf/core: Fix potential double-fetch bug · f12f42ac
      Meng Xu 提交于
      While examining the kernel source code, I found a dangerous operation that
      could turn into a double-fetch situation (a race condition bug) where the same
      userspace memory region are fetched twice into kernel with sanity checks after
      the first fetch while missing checks after the second fetch.
      
        1. The first fetch happens in line 9573 get_user(size, &uattr->size).
      
        2. Subsequently the 'size' variable undergoes a few sanity checks and
           transformations (line 9577 to 9584).
      
        3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)
      
        4. Given that 'uattr' can be fully controlled in userspace, an attacker can
           race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
           after the first fetch but before the second fetch. The changed value will be
           copied to 'attr->size'.
      
        5. There is no further checks on 'attr->size' until the end of this function,
           and once the function returns, we lose the context to verify that 'attr->size'
           conforms to the sanity checks performed in step 2 (line 9577 to 9584).
      
        6. My manual analysis shows that 'attr->size' is not used elsewhere later,
           so, there is no working exploit against it right now. However, this could
           easily turns to an exploitable one if careless developers start to use
           'attr->size' later.
      
      To fix this, override 'attr->size' from the second fetch to the one from the
      first fetch, regardless of what is actually copied in.
      
      In this way, it is assured that 'attr->size' is consistent with the checks
      performed after the first fetch.
      Signed-off-by: NMeng Xu <mengxu.gatech@gmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: meng.xu@gatech.edu
      Cc: sanidhya@gatech.edu
      Cc: taesoo@gatech.edu
      Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f12f42ac
    • D
      bpf: fix oops on allocation failure · f740c34e
      Dan Carpenter 提交于
      "err" is set to zero if bpf_map_area_alloc() fails so it means we return
      ERR_PTR(0) which is NULL.  The caller, find_and_alloc_map(), is not
      expecting NULL returns and will oops.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f740c34e
    • J
      bpf: sockmap indicate sock events to listeners · 78aeaaef
      John Fastabend 提交于
      After userspace pushes sockets into a sockmap it may not be receiving
      data (assuming stream_{parser|verdict} programs are attached). But, it
      may still want to manage the socks. A common pattern is to poll/select
      for a POLLRDHUP event so we can close the sock.
      
      This patch adds the logic to wake up these listeners.
      
      Also add TCP_SYN_SENT to the list of events to handle. We don't want
      to break the connection just because we happen to be in this state.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78aeaaef
    • J
      bpf: harden sockmap program attach to ensure correct map type · 81374aaa
      John Fastabend 提交于
      When attaching a program to sockmap we need to check map type
      is correct.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81374aaa
    • J
      bpf: sockmap add missing rcu_read_(un)lock in smap_data_ready · d26e597d
      John Fastabend 提交于
      References to psock must be done inside RCU critical section.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d26e597d
    • J
      bpf: sockmap, remove STRPARSER map_flags and add multi-map support · 2f857d04
      John Fastabend 提交于
      The addition of map_flags BPF_SOCKMAP_STRPARSER flags was to handle a
      specific use case where we want to have BPF parse program disabled on
      an entry in a sockmap.
      
      However, Alexei found the API a bit cumbersome and I agreed. Lets
      remove the STRPARSER flag and support the use case by allowing socks
      to be in multiple maps. This allows users to create two maps one with
      programs attached and one without. When socks are added to maps they
      now inherit any programs attached to the map. This is a nice
      generalization and IMO improves the API.
      
      The API rules are less ambiguous and do not need a flag:
      
        - When a sock is added to a sockmap we have two cases,
      
           i. The sock map does not have any attached programs so
              we can add sock to map without inheriting bpf programs.
              The sock may exist in 0 or more other maps.
      
          ii. The sock map has an attached BPF program. To avoid duplicate
              bpf programs we only add the sock entry if it does not have
              an existing strparser/verdict attached, returning -EBUSY if
              a program is already attached. Otherwise attach the program
              and inherit strparser/verdict programs from the sock map.
      
      This allows for socks to be in a multiple maps for redirects and
      inherit a BPF program from a single map.
      
      Also this patch simplifies the logic around BPF_{EXIST|NOEXIST|ANY}
      flags. In the original patch I tried to be extra clever and only
      update map entries when necessary. Now I've decided the complexity
      is not worth it. If users constantly update an entry with the same
      sock for no reason (i.e. update an entry without actually changing
      any parameters on map or sock) we still do an alloc/release. Using
      this and allowing multiple entries of a sock to exist in a map the
      logic becomes much simpler.
      
      Note: Now that multiple maps are supported the "maps" pointer called
      when a socket is closed becomes a list of maps to remove the sock from.
      To keep the map up to date when a sock is added to the sockmap we must
      add the map/elem in the list. Likewise when it is removed we must
      remove it from the list. This results in searching the per psock list
      on delete operation. On TCP_CLOSE events we walk the list and remove
      the psock from all map/entry locations. I don't see any perf
      implications in this because at most I have a psock in two maps. If
      a psock were to be in many maps its possibly this might be noticeable
      on delete but I can't think of a reason to dup a psock in many maps.
      The sk_callback_lock is used to protect read/writes to the list. This
      was convenient because in all locations we were taking the lock
      anyways just after working on the list. Also the lock is per sock so
      in normal cases we shouldn't see any contention.
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f857d04
    • J
      bpf: convert sockmap field attach_bpf_fd2 to type · 464bc0fd
      John Fastabend 提交于
      In the initial sockmap API we provided strparser and verdict programs
      using a single attach command by extending the attach API with a the
      attach_bpf_fd2 field.
      
      However, if we add other programs in the future we will be adding a
      field for every new possible type, attach_bpf_fd(3,4,..). This
      seems a bit clumsy for an API. So lets push the programs using two
      new type fields.
      
         BPF_SK_SKB_STREAM_PARSER
         BPF_SK_SKB_STREAM_VERDICT
      
      This has the advantage of having a readable name and can easily be
      extended in the future.
      
      Updates to samples and sockmap included here also generalize tests
      slightly to support upcoming patch for multiple map support.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      464bc0fd
  8. 28 8月, 2017 1 次提交
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  9. 26 8月, 2017 1 次提交
    • J
      time: Fix ktime_get_raw() incorrect base accumulation · 0bcdc098
      John Stultz 提交于
      In comqit fc6eead7 ("time: Clean up CLOCK_MONOTONIC_RAW time
      handling"), the following code got mistakenly added to the update of the
      raw timekeeper:
      
       /* Update the monotonic raw base */
       seconds = tk->raw_sec;
       nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift);
       tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
      
      Which adds the raw_sec value and the shifted down raw xtime_nsec to the
      base value.
      
      But the read function adds the shifted down tk->tkr_raw.xtime_nsec value
      another time, The result of this is that ktime_get_raw() users (which are
      all internal users) see the raw time move faster then it should (the rate
      at which can vary with the current size of tkr_raw.xtime_nsec), which has
      resulted in at least problems with graphics rendering performance.
      
      The change tried to match the monotonic base update logic:
      
       seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
       nsec = (u32) tk->wall_to_monotonic.tv_nsec;
       tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
      
      Which adds the wall_to_monotonic.tv_nsec value, but not the
      tk->tkr_mono.xtime_nsec value to the base.
      
      To fix this, simplify the tkr_raw.base accumulation to only accumulate the
      raw_sec portion, and do not include the tkr_raw.xtime_nsec portion, which
      will be added at read time.
      
      Fixes: fc6eead7 ("time: Clean up CLOCK_MONOTONIC_RAW time handling")
      Reported-and-tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Link: http://lkml.kernel.org/r/1503701824-1645-1-git-send-email-john.stultz@linaro.org
      0bcdc098