1. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  2. 12 9月, 2017 4 次提交
  3. 11 9月, 2017 1 次提交
  4. 09 9月, 2017 16 次提交
  5. 07 9月, 2017 9 次提交
    • P
      sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632
      Peter Zijlstra 提交于
      Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
      because it now resulted in non-cpuset usage breaking too.
      
      On suspend cpuset_cpu_inactive() doesn't call into
      cpuset_update_active_cpus() because it doesn't want to move tasks about,
      there is no need, all tasks are frozen and won't run again until after
      we've resumed everything.
      
      But this means that when we finally do call into
      cpuset_update_active_cpus() after resuming the last frozen cpu in
      cpuset_cpu_active(), the top_cpuset will not have any difference with
      the cpu_active_mask and this it will not in fact do _anything_.
      
      So the cpuset configuration will not be restored. This was largely
      hidden because we would unconditionally create identity domains and
      mobile users would not in fact use cpusets much. And servers what do use
      cpusets tend to not suspend-resume much.
      
      An addition problem is that we'd not in fact wait for the cpuset work to
      finish before resuming the tasks, allowing spurious migrations outside
      of the specified domains.
      
      Fix the rebuild by introducing cpuset_force_rebuild() and fix the
      ordering with cpuset_wait_for_hotplug().
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
      Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50e76632
    • T
      genirq: Make sparse_irq_lock protect what it should protect · 12ac1d0f
      Thomas Gleixner 提交于
      for_each_active_irq() iterates the sparse irq allocation bitmap. The caller
      must hold sparse_irq_lock. Several code pathes expect that an active bit in
      the sparse bitmap also has a valid interrupt descriptor.
      
      Unfortunately that's not true. The (de)allocation is a two step process,
      which holds the sparse_irq_lock only across the queue/remove from the radix
      tree and the set/clear in the allocation bitmap.
      
      If a iteration locks sparse_irq_lock between the two steps, then it might
      see an active bit but the corresponding irq descriptor is NULL. If that is
      dereferenced unconditionally, then the kernel oopses. Of course, all
      iterator sites could be audited and fixed, but....
      
      There is no reason why the sparse_irq_lock needs to be dropped between the
      two steps, in fact the code becomes simpler when the mutex is held across
      both and the semantics become more straight forward, so future problems of
      missing NULL pointer checks in the iteration are avoided and all existing
      sites are fixed in one go.
      
      Expand the lock held sections so both operations are covered and the bitmap
      and the radixtree are in sync.
      
      Fixes: a05a900a ("genirq: Make sparse_lock a mutex")
      Reported-and-tested-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      12ac1d0f
    • P
      sched/fair: Fix wake_affine_llc() balancing rules · a731ebe6
      Peter Zijlstra 提交于
      Chris Wilson reported that the SMT balance rules got the +1 on the
      wrong side, resulting in a bias towards the current LLC; which the
      load-balancer would then try and undo.
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 90001d67 ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
      Link: http://lkml.kernel.org/r/20170906105131.gqjmaextmn3u6tj2@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a731ebe6
    • B
      tracing: Apply trace_clock changes to instance max buffer · 170b3b10
      Baohong Liu 提交于
      Currently trace_clock timestamps are applied to both regular and max
      buffers only for global trace. For instance trace, trace_clock
      timestamps are applied only to regular buffer. But, regular and max
      buffers can be swapped, for example, following a snapshot. So, for
      instance trace, bad timestamps can be seen following a snapshot.
      Let's apply trace_clock timestamps to instance max buffer as well.
      
      Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com
      
      Cc: stable@vger.kernel.org
      Fixes: 277ba044 ("tracing: Add interface to allow multiple trace buffers")
      Signed-off-by: NBaohong Liu <baohong.liu@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      170b3b10
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • A
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli 提交于
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      MADV_DONTNEED).
      
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      free_pgtables.
      
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21292580
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
    • D
      mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups · ab1b597e
      Dan Williams 提交于
      devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
      per section's worth of memory (128MB).  The key for each of those
      entries is a section number.
      
      This leads to false positives when devm_memremap_pages() is passed a
      section-unaligned range as lookups in the misalignment fail to return
      NULL.  We can close this hole by using the pfn as the key for entries in
      the tree.  The number of entries required to describe a remapped range
      is reduced by leveraging multi-order entries.
      
      In practice this approach usually yields just one entry in the tree if
      the size and starting address are of the same power-of-2 alignment.
      Previously we always needed nr_entries = mapping_size / 128MB.
      
      Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
      Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab1b597e
    • R
      cgroup: revert fa06235b ("cgroup: reset css on destruction") · 65f3975f
      Roman Gushchin 提交于
      Commit fa06235b ("cgroup: reset css on destruction") caused
      css_reset callback to be called from the offlining path.  Although it
      solves the problem mentioned in the commit description ("For instance,
      memory cgroup needs to reset memory.low, otherwise pages charged to a
      dead cgroup might never get reclaimed."), generally speaking, it's not
      correct.
      
      An offline cgroup can still be a resource domain, and we shouldn't grant
      it more resources than it had before deletion.
      
      For instance, if an offline memory cgroup has dirty pages, we should
      still imply i/o limits during writeback.
      
      The css_reset callback is designed to return the cgroup state into the
      original state, that means reset all limits and counters.  It's
      spomething different from the offlining, and we shouldn't use it from
      the offlining path.  Instead, we should adjust necessary settings from
      the per-controller css_offline callbacks (e.g.  reset memory.low).
      
      Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65f3975f
  6. 06 9月, 2017 3 次提交
    • J
      genirq/msi: Fix populating multiple interrupts · 596a7a1d
      John Keeping 提交于
      On allocating the interrupts routed via a wire-to-MSI bridge, the allocator
      iterates over the MSI descriptors to build the hierarchy, but fails to use
      the descriptor interrupt number, and instead uses the base number,
      generating the wrong IRQ domain mappings.
      
      The fix is to use the MSI descriptor interrupt number when setting up
      the interrupt instead of the base interrupt for the allocation range.
      
      The only saving grace is that although the MSI descriptors are allocated
      in bulk, the wired interrupts are only allocated one by one (so
      desc->irq == virq) and the bug went unnoticed so far.
      
      Fixes: 2145ac93 ("genirq/msi: Add msi_domain_populate_irqs")
      Signed-off-by: NJohn Keeping <john@metanate.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170906103540.373864a2.john@metanate.com
      596a7a1d
    • E
      bpf: fix numa_node validation · 96e5ae4e
      Eric Dumazet 提交于
      syzkaller reported crashes in bpf map creation or map update [1]
      
      Problem is that nr_node_ids is a signed integer,
      NUMA_NO_NODE is also an integer, so it is very tempting
      to declare numa_node as a signed integer.
      
      This means the typical test to validate a user provided value :
      
              if (numa_node != NUMA_NO_NODE &&
                  (numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      must be written :
      
              if (numa_node != NUMA_NO_NODE &&
                  ((unsigned int)numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      [1]
      kernel BUG at mm/slab.c:3256!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d2bc60c0 task.stack: ffff8801c0c90000
      RIP: 0010:____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
      RSP: 0018:ffff8801c0c97638 EFLAGS: 00010096
      RAX: ffffffffffff8b7b RBX: 0000000001080220 RCX: 0000000000000000
      RDX: 00000000ffff8b7b RSI: 0000000001080220 RDI: ffff8801dac00040
      RBP: ffff8801c0c976c0 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8801c0c97620 R11: 0000000000000001 R12: ffff8801dac00040
      R13: ffff8801dac00040 R14: 0000000000000000 R15: 00000000ffff8b7b
      FS:  0000000002119940(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001fec CR3: 00000001d2980000 CR4: 00000000001406f0
      Call Trace:
       __do_kmalloc_node mm/slab.c:3688 [inline]
       __kmalloc_node+0x33/0x70 mm/slab.c:3696
       kmalloc_node include/linux/slab.h:535 [inline]
       alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
       htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
       map_update_elem kernel/bpf/syscall.c:587 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
       SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x440409
      RSP: 002b:00007ffd1f1792b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440409
      RDX: 0000000000000020 RSI: 0000000020006000 RDI: 0000000000000002
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401d70
      R13: 0000000000401e00 R14: 0000000000000000 R15: 0000000000000000
      Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
      RIP: ____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: ffff8801c0c97638
      ---[ end trace d745f355da2e33ce ]---
      Kernel panic - not syncing: Fatal exception
      
      Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96e5ae4e
    • C
      tracing: Fix clear of RECORDED_TGID flag when disabling trace event · 7685ab6c
      Chunyu Hu 提交于
      When disabling one trace event, the RECORDED_TGID flag in the event
      file is not correctly cleared. It's clearing RECORDED_CMD flag when
      it should clear RECORDED_TGID flag.
      
      Link: http://lkml.kernel.org/r/1504589806-8425-1-git-send-email-chuhu@redhat.com
      
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: stable@vger.kernel.org
      Fixes: d914ba37 ("tracing: Add support for recording tgid of tasks")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      7685ab6c
  7. 05 9月, 2017 3 次提交
    • S
      tracing: Add barrier to trace_printk() buffer nesting modification · 3d9622c1
      Steven Rostedt (VMware) 提交于
      trace_printk() uses 4 buffers, one for each context (normal, softirq, irq
      and NMI), such that it does not need to worry about one context preempting
      the other. There's a nesting counter that gets incremented to figure out
      which buffer to use. If the context gets preempted by another context which
      calls trace_printk() it will increment the counter and use the next buffer,
      and restore the counter when it is finished.
      
      The problem is that gcc may optimize the modification of the buffer nesting
      counter and it may not be incremented in memory before the buffer is used.
      If this happens, and the context gets interrupted by another context, it
      could pick the same buffer and corrupt the one that is being used.
      
      Compiler barriers need to be added after the nesting variable is incremented
      and before it is decremented to prevent usage of the context buffers by more
      than one context at the same time.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: e2ace001 ("tracing: Choose static tp_printk buffer by explicit nesting count")
      Hat-tip-to: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      3d9622c1
    • G
      audit: update the function comments · 196a5085
      Geliang Tang 提交于
      Update the function comments to match the code.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      196a5085
    • M
      audit: Reduce overhead using a coarse clock · e832bf48
      Mel Gorman 提交于
      Commit 2115bb25 ("audit: Use timespec64 to represent audit timestamps")
      noted that audit timestamps were not y2038 safe and used a 64-bit
      timestamp. In itself, this makes sense but the conversion was from
      CURRENT_TIME to ktime_get_real_ts64() which is a heavier call to record
      an accurate timestamp which is required in some, but not all, cases. The
      impact is that when auditd is running without any rules that all syscalls
      have higher overhead. This is visible in the sysbench-thread benchmark as
      a 11.5% performance hit. That benchmark is dumb as rocks but it's also
      visible in redis as an 8-10% hit on all operations which is of greater
      concern. It is somewhat stupid of audit to track syscalls without any
      rules related to syscalls but that is how it behaves.
      
      The overhead can be directly measured with perf comparing 4.9 with 4.12
      
      4.9
           7.76%  sysbench         [kernel.vmlinux]    [k] __schedule
           7.62%  sysbench         [kernel.vmlinux]    [k] _raw_spin_lock
           7.37%  sysbench         libpthread-2.22.so  [.] __lll_lock_elision
           7.29%  sysbench         [kernel.vmlinux]    [.] syscall_return_via_sysret
           6.59%  sysbench         [kernel.vmlinux]    [k] native_sched_clock
           5.21%  sysbench         libc-2.22.so        [.] __sched_yield
           4.38%  sysbench         [kernel.vmlinux]    [k] entry_SYSCALL_64
           4.28%  sysbench         [kernel.vmlinux]    [k] do_syscall_64
           3.49%  sysbench         libpthread-2.22.so  [.] __lll_unlock_elision
           3.13%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_exit
           2.87%  sysbench         [kernel.vmlinux]    [k] update_curr
           2.73%  sysbench         [kernel.vmlinux]    [k] pick_next_task_fair
           2.31%  sysbench         [kernel.vmlinux]    [k] syscall_trace_enter
           2.20%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_entry
      .....
           0.00%  swapper          [kernel.vmlinux]    [k] read_tsc
      
      4.12
           7.84%  sysbench         [kernel.vmlinux]    [k] __schedule
           7.05%  sysbench         [kernel.vmlinux]    [k] _raw_spin_lock
           6.57%  sysbench         libpthread-2.22.so  [.] __lll_lock_elision
           6.50%  sysbench         [kernel.vmlinux]    [.] syscall_return_via_sysret
           5.95%  sysbench         [kernel.vmlinux]    [k] read_tsc
           5.71%  sysbench         [kernel.vmlinux]    [k] native_sched_clock
           4.78%  sysbench         libc-2.22.so        [.] __sched_yield
           4.30%  sysbench         [kernel.vmlinux]    [k] entry_SYSCALL_64
           3.94%  sysbench         [kernel.vmlinux]    [k] do_syscall_64
           3.37%  sysbench         libpthread-2.22.so  [.] __lll_unlock_elision
           3.32%  sysbench         [kernel.vmlinux]    [k] __audit_syscall_exit
           2.91%  sysbench         [kernel.vmlinux]    [k] __getnstimeofday64
      
      Note the additional overhead from read_tsc which goes from 0% to 5.95%.
      This is on a single-socket E3-1230 but similar overheads have been measured
      on an older machine which the patch also eliminates.
      
      The patch in question has no explanation as to why a fully-accurate timestamp
      is required and is likely an oversight.  Using a coarser, but monotically
      increasing, timestamp the overhead can be eliminated.  While it can be
      worked around by configuring or disabling audit, it's tricky enough to
      detect that a kernel fix is justified. With this patch, we see the following;
      
      sysbenchthread
                                    4.9.0                 4.12.0                 4.12.0
                                  vanilla                vanilla            coarse-v1r1
      Amean     1         1.49 (   0.00%)        1.66 ( -11.42%)        1.51 (  -1.34%)
      Amean     3         1.48 (   0.00%)        1.65 ( -11.45%)        1.50 (  -0.96%)
      Amean     5         1.49 (   0.00%)        1.67 ( -12.31%)        1.51 (  -1.83%)
      Amean     7         1.49 (   0.00%)        1.66 ( -11.72%)        1.50 (  -0.67%)
      Amean     12        1.48 (   0.00%)        1.65 ( -11.57%)        1.52 (  -2.89%)
      Amean     16        1.49 (   0.00%)        1.65 ( -11.13%)        1.51 (  -1.73%)
      
      The benchmark is reporting the time required for different thread counts to
      lock/unlock a private mutex which, while dense, demonstrates the syscall
      overhead. This is showing that 4.12 took a 11-12% hit but the overhead is
      almost eliminated by the patch. While the variance is not reported here,
      it's well within the noise with the patch applied.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      e832bf48
  8. 02 9月, 2017 2 次提交
    • J
      bpf: sockmap update/simplify memory accounting scheme · 90a9631c
      John Fastabend 提交于
      Instead of tracking wmem_queued and sk_mem_charge by incrementing
      in the verdict SK_REDIRECT paths and decrementing in the tx work
      path use skb_set_owner_w and sock_writeable helpers. This solves
      a few issues with the current code. First, in SK_REDIRECT inc on
      sk_wmem_queued and sk_mem_charge were being done without the peers
      sock lock being held. Under stress this can result in accounting
      errors when tx work and/or multiple verdict decisions are working
      on the peer psock.
      
      Additionally, this cleans up the code because we can rely on the
      default destructor to decrement memory accounting on kfree_skb. Also
      this will trigger sk_write_space when space becomes available on
      kfree_skb() which wasn't happening before and prevent __sk_free
      from being called until all in-flight packets are completed.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90a9631c
    • S
      ftrace: Fix memleak when unregistering dynamic ops when tracing disabled · edb096e0
      Steven Rostedt (VMware) 提交于
      If function tracing is disabled by the user via the function-trace option or
      the proc sysctl file, and a ftrace_ops that was allocated on the heap is
      unregistered, then the shutdown code exits out without doing the proper
      clean up. This was found via kmemleak and running the ftrace selftests, as
      one of the tests unregisters with function tracing disabled.
      
       # cat kmemleak
      unreferenced object 0xffffffffa0020000 (size 4096):
        comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s)
        hex dump (first 32 bytes):
          55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89  U.t$.UH...t$.UH.
          e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c  .H......H.D$PH.L
        backtrace:
          [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0
          [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0
          [<ffffffff8109697f>] module_alloc+0x4f/0x90
          [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420
          [<ffffffff81249947>] ftrace_startup+0xe7/0x300
          [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90
          [<ffffffff81263786>] trace_selftest_ops+0x204/0x397
          [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624
          [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7
          [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192
          [<ffffffff81002230>] do_one_initcall+0x90/0x1e2
          [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe
          [<ffffffff81d61ec3>] kernel_init+0x13/0x122
          [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Cc: stable@vger.kernel.org
      Fixes: 12cce594 ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      edb096e0