1. 25 8月, 2017 2 次提交
    • P
      sched/topology: Improve comments · a090c4f2
      Peter Zijlstra 提交于
      Mike provided a better comment for destroy_sched_domain() ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a090c4f2
    • S
      sched/topology: Fix memory leak in __sdt_alloc() · 213c5a45
      Shu Wang 提交于
      Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
      __sdt_alloc() might be leaked as each domain holds many groups' ref,
      but in destroy_sched_domain(), it only declined the first group ref.
      
      Onlining and offlining a CPU can trigger this leak, and cause OOM.
      
      Reproducer for my 6 CPUs machine:
      
        while true
        do
            echo 0 > /sys/devices/system/cpu/cpu5/online;
            echo 1 > /sys/devices/system/cpu/cpu5/online;
        done
      
        unreferenced object 0xffff88007d772a80 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00  ."w}............
            40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00  @*w}............
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      
        unreferenced object 0xffff88007d772a40 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00  ................
            00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00  ........O<......
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810da16d>] build_sched_domains+0xead/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NChunyu Hu <chuhu@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: liwang@redhat.com
      Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      213c5a45
  2. 24 8月, 2017 3 次提交
    • S
      tracing: Fix freeing of filter in create_filter() when set_str is false · 8b0db1a5
      Steven Rostedt (VMware) 提交于
      Performing the following task with kmemleak enabled:
      
       # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
       # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
       # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
       # echo scan > /sys/kernel/debug/kmemleak
       # cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800b9290308 (size 32):
        comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
          [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
          [<ffffffff812639c9>] create_filter+0xa9/0x160
          [<ffffffff81263bdc>] create_event_filter+0xc/0x10
          [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
          [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
          [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
          [<ffffffff8138cf87>] __vfs_write+0xd7/0x380
          [<ffffffff8138f421>] vfs_write+0x101/0x260
          [<ffffffff8139187b>] SyS_write+0xab/0x130
          [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The function create_filter() is passed a 'filterp' pointer that gets
      allocated, and if "set_str" is true, it is up to the caller to free it, even
      on error. The problem is that the pointer is not freed by create_filter()
      when set_str is false. This is a bug, and it is not up to the caller to free
      the filter on error if it doesn't care about the string.
      
      Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 38b78eb8 ("tracing: Factorize filter creation")
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Tested-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8b0db1a5
    • C
      tracing: Fix kmemleak in tracing_map_array_free() · 475bb3c6
      Chunyu Hu 提交于
      kmemleak reported the below leak when I was doing clear of the hist
      trigger. With this patch, the kmeamleak is gone.
      
      unreferenced object 0xffff94322b63d760 (size 32):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00  ................
          10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff  ..........z.1...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
          [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff9431f27aa880 (size 128):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff  ...*2......*2...
          00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff  ...*2......*2...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e425348>] __kmalloc+0xe8/0x220
          [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      475bb3c6
    • S
      ftrace: Check for null ret_stack on profile function graph entry function · a8f0f9e4
      Steven Rostedt (VMware) 提交于
      There's a small race when function graph shutsdown and the calling of the
      registered function graph entry callback. The callback must not reference
      the task's ret_stack without first checking that it is not NULL. Note, when
      a ret_stack is allocated for a task, it stays allocated until the task exits.
      The problem here, is that function_graph is shutdown, and a new task was
      created, which doesn't have its ret_stack allocated. But since some of the
      functions are still being traced, the callbacks can still be called.
      
      The normal function_graph code handles this, but starting with commit
      8861dd30 ("ftrace: Access ret_stack->subtime only in the function
      profiler") the profiler code references the ret_stack on function entry, but
      doesn't check if it is NULL first.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611
      
      Cc: stable@vger.kernel.org
      Fixes: 8861dd30 ("ftrace: Access ret_stack->subtime only in the function profiler")
      Reported-by: lilydjwg@gmail.com
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a8f0f9e4
  3. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  4. 20 8月, 2017 1 次提交
  5. 19 8月, 2017 2 次提交
    • J
      signal: don't remove SIGNAL_UNKILLABLE for traced tasks. · eb61b591
      Jamie Iles 提交于
      When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
      faults, but this is undesirable when tracing.  For example, debugging an
      init process (whether global or namespace), hitting a breakpoint and
      SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
      Everything continues fine, but then once debugging has finished, the
      init process is left killable which is unlikely what the user expects,
      resulting in either an accidentally killed init or an init that stops
      reaping zombies.
      
      Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb61b591
    • L
      kmod: fix wait on recursive loop · 2ba293c9
      Luis R. Rodriguez 提交于
      Recursive loops with module loading were previously handled in kmod by
      restricting the number of modprobe calls to 50 and if that limit was
      breached request_module() would return an error and a user would see the
      following on their kernel dmesg:
      
        request_module: runaway loop modprobe binfmt-464c
        Starting init:/sbin/init exists but couldn't execute it (error -8)
      
      This issue could happen for instance when a 64-bit kernel boots a 32-bit
      userspace on some architectures and has no 32-bit binary format
      hanlders.  This is visible, for instance, when a CONFIG_MODULES enabled
      64-bit MIPS kernel boots a into o32 root filesystem and the binfmt
      handler for o32 binaries is not built-in.
      
      After commit 6d7964a7 ("kmod: throttle kmod thread limit") we now
      don't have any visible signs of an error and the kernel just waits for
      the loop to end somehow.
      
      Although this *particular* recursive loop could also be addressed by
      doing a sanity check on search_binary_handler() and disallowing a
      modular binfmt to be required for modprobe, a generic solution for any
      recursive kernel kmod issues is still needed.
      
      This should catch these loops.  We can investigate each loop and address
      each one separately as they come in, this however puts a stop gap for
      them as before.
      
      Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org
      Fixes: 6d7964a7 ("kmod: throttle kmod thread limit")
      Signed-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: NMatt Redfearn <matt.redfearn@imgtec.com>
      Tested-by: NMatt Redfearn <matt.redfearn@imgetc.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ba293c9
  6. 18 8月, 2017 2 次提交
    • T
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner 提交于
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: NKan Liang <Kan.liang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
    • M
      genirq: Restore trigger settings in irq_modify_status() · e8f24189
      Marc Zyngier 提交于
      irq_modify_status starts by clearing the trigger settings from
      irq_data before applying the new settings, but doesn't restore them,
      leaving them to IRQ_TYPE_NONE.
      
      That's pretty confusing to the potential request_irq() that could
      follow. Instead, snapshot the settings before clearing them, and restore
      them if the irq_modify_status() invocation was not changing the trigger.
      
      Fixes: 1e2a7d78 ("irqdomain: Don't set type when mapping an IRQ")
      Reported-and-tested-by: Njeffy <jeffy.chen@rock-chips.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
      e8f24189
  7. 17 8月, 2017 1 次提交
  8. 16 8月, 2017 3 次提交
    • D
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann 提交于
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a5c690
    • J
      audit: Receive unmount event · b5fed474
      Jan Kara 提交于
      Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is
      not part of AUDIT_FS_WATCH mask and thus such event never gets to
      audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify
      subsystem on unmount without audit being notified about that which leads
      to a strange state of existing audit rules with dead fsnotify marks.
      
      Add FS_UNMOUNT to the mask of events to be received so that audit can
      clean up its state accordingly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b5fed474
    • J
      audit: Fix use after free in audit_remove_watch_rule() · d76036ab
      Jan Kara 提交于
      audit_remove_watch_rule() drops watch's reference to parent but then
      continues to work with it. That is not safe as parent can get freed once
      we drop our reference. The following is a trivial reproducer:
      
      mount -o loop image /mnt
      touch /mnt/file
      auditctl -w /mnt/file -p wax
      umount /mnt
      auditctl -D
      <crash in fsnotify_destroy_mark()>
      
      Grab our own reference in audit_remove_watch_rule() earlier to make sure
      mark does not get freed under us.
      
      CC: stable@vger.kernel.org
      Reported-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      d76036ab
  9. 11 8月, 2017 2 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  10. 10 8月, 2017 23 次提交