1. 14 12月, 2016 1 次提交
  2. 01 12月, 2016 1 次提交
  3. 30 11月, 2016 1 次提交
    • L
      Re-enable CONFIG_MODVERSIONS in a slightly weaker form · faaae2a5
      Linus Torvalds 提交于
      This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC
      information in order to work around the issue that newer binutils
      versions seem to occasionally drop the CRC on the floor.  binutils 2.26
      seems to work fine, while binutils 2.27 seems to break MODVERSIONS of
      symbols that have been defined in assembler files.
      
      [ We've had random missing CRC's before - it may be an old problem that
        just is now reliably triggered with the weak asm symbols and a new
        version of binutils ]
      
      Some day I really do want to remove MODVERSIONS entirely.  Sadly, today
      does not appear to be that day: Debian people apparently do want the
      option to enable MODVERSIONS to make it easier to have external modules
      across kernel versions, and this seems to be a fairly minimal fix for
      the annoying problem.
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Acked-by: NMichal Marek <mmarek@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faaae2a5
  4. 22 11月, 2016 2 次提交
    • O
      sched/autogroup: Do not use autogroup->tg in zombie threads · 8e5bfa8c
      Oleg Nesterov 提交于
      Exactly because for_each_thread() in autogroup_move_group() can't see it
      and update its ->sched_task_group before _put() and possibly free().
      
      So the exiting task needs another sched_move_task() before exit_notify()
      and we need to re-introduce the PF_EXITING (or similar) check removed by
      the previous change for another reason.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hartsjc@redhat.com
      Cc: vbendel@redhat.com
      Cc: vlovejoy@redhat.com
      Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e5bfa8c
    • O
      sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task() · 18f649ef
      Oleg Nesterov 提交于
      The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
      it, but see the next patch.
      
      However the comment is correct in that autogroup_move_group() must always
      change task_group() for every thread so the sysctl_ check is very wrong;
      we can race with cgroups and even sys_setsid() is not safe because a task
      running with task_group() == ag->tg must participate in refcounting:
      
      	int main(void)
      	{
      		int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);
      
      		assert(sctl > 0);
      		if (fork()) {
      			wait(NULL); // destroy the child's ag/tg
      			pause();
      		}
      
      		assert(pwrite(sctl, "1\n", 2, 0) == 2);
      		assert(setsid() > 0);
      		if (fork())
      			pause();
      
      		kill(getppid(), SIGKILL);
      		sleep(1);
      
      		// The child has gone, the grandchild runs with kref == 1
      		assert(pwrite(sctl, "0\n", 2, 0) == 2);
      		assert(setsid() > 0);
      
      		// runs with the freed ag/tg
      		for (;;)
      			sleep(1);
      
      		return 0;
      	}
      
      crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
      autogroup_move_group() actually frees the task_group or this happens later.
      Reported-by: NVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hartsjc@redhat.com
      Cc: vbendel@redhat.com
      Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18f649ef
  5. 21 11月, 2016 1 次提交
  6. 19 11月, 2016 1 次提交
  7. 17 11月, 2016 1 次提交
    • J
      bpf: fix range arithmetic for bpf map access · f23cc643
      Josef Bacik 提交于
      I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
      invalid accesses to bpf map entries.  Fix this up by doing a few things
      
      1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
      life and just adds extra complexity.
      
      2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
      minimum value to 0 for positive AND's.
      
      3) Don't do operations on the ranges if they are set to the limits, as they are
      by definition undefined, and allowing arithmetic operations on those values
      could make them appear valid when they really aren't.
      
      This fixes the testcase provided by Jann as well as a few other theoretical
      problems.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f23cc643
  8. 15 11月, 2016 4 次提交
    • D
      perf/core: Do not set cpuctx->cgrp for unscheduled cgroups · 864c2357
      David Carrillo-Cisneros 提交于
      Commit:
      
        db4a8356 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
      
      failed to verify that event->cgrp is actually the scheduled cgroup
      in a CPU before setting cpuctx->cgrp. This patch fixes that.
      
      Now that there is a different path for scheduled and unscheduled
      cgroup, add a warning to catch when cpuctx->cgrp is still set after
      the last cgroup event has been unsheduled.
      
      To verify the bug:
      
        # Create 2 cgroups.
        mkdir /dev/cgroups/devices/g1
        mkdir /dev/cgroups/devices/g2
      
        # launch a task, bind it to a cpu and move it to g1
        CPU=2
        while :; do : ; done &
        P=$!
      
        taskset -pc $CPU $P
        echo $P > /dev/cgroups/devices/g1/tasks
      
        # monitor g2 (it runs no tasks) and observe output
        perf stat -e cycles -I 1000 -C $CPU -G g2
      
        #           time             counts unit events
           1.000091408          7,579,527      cycles                    g2
           2.000350111      <not counted>      cycles                    g2
           3.000589181      <not counted>      cycles                    g2
           4.000771428      <not counted>      cycles                    g2
      
        # note first line that displays that a task run in g2, despite
        # g2 having no tasks. This is because cpuctx->cgrp was wrongly
        # set when context of new event was installed.
        # After applying the fix we obtain the right output:
      
        perf stat -e cycles -I 1000 -C $CPU -G g2
        #           time             counts unit events
           1.000119615      <not counted>      cycles                    g2
           2.000389430      <not counted>      cycles                    g2
           3.000590962      <not counted>      cycles                    g2
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      864c2357
    • S
      ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records · 546fece4
      Steven Rostedt (Red Hat) 提交于
      When a module is first loaded and its function ip records are added to the
      ftrace list of functions to modify, they are set to DISABLED, as their text
      is still in a read only state. When the module is fully loaded, and can be
      updated, the flag is cleared, and if their's any functions that should be
      tracing them, it is updated at that moment.
      
      But there's several locations that do record accounting and should ignore
      records that are marked as disabled, or they can cause issues.
      
      Alexei already fixed one location, but others need to be addressed.
      
      Cc: stable@vger.kernel.org
      Fixes: b7ffffbb "ftrace: Add infrastructure for delayed enabling of module functions"
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      546fece4
    • A
      ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records · 977c1f9c
      Alexei Starovoitov 提交于
      ftrace_shutdown() checks for sanity of ftrace records
      and if dyn_ftrace->flags is not zero, it will warn.
      It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
      since some module was loaded, but before ftrace_module_enable()
      cleared the flags for this module.
      
      In other words the module.c is doing:
      ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
      ... // here ftrace_shutdown() is called that warns, since
      err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED
      
      Fix it by ignoring disabled records.
      It's similar to what __ftrace_hash_rec_update() is already doing.
      
      Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com
      
      Cc: stable@vger.kernel.org
      Fixes: b7ffffbb "ftrace: Add infrastructure for delayed enabling of module functions"
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      977c1f9c
    • L
      Revert "printk: make reading the kernel log flush pending lines" · f5c9f9c7
      Linus Torvalds 提交于
      This reverts commit bfd8d3f2.
      
      It turns out that this flushes things much too aggressiverly, and causes
      lines to break up when the system logger races with new continuation
      lines being printed.
      
      There's a pending patch to make printk() flushing much more
      straightforward, but it's too invasive for 4.9, so in the meantime let's
      just not make the system message logging flush continuation lines.
      They'll be flushed by the final newline anyway.
      Suggested-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c9f9c7
  9. 12 11月, 2016 1 次提交
    • H
      Revert "console: don't prefer first registered if DT specifies stdout-path" · c6c7d83b
      Hans de Goede 提交于
      This reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path").
      
      The reverted commit changes existing behavior on which many ARM boards
      rely.  Many ARM small-board-computers, like e.g.  the Raspberry Pi have
      both a video output and a serial console.  Depending on whether the user
      is using the device as a more regular computer; or as a headless device
      we need to have the console on either one or the other.
      
      Many users rely on the kernel behavior of the console being present on
      both outputs, before the reverted commit the console setup with no
      console= kernel arguments on an ARM board which sets stdout-path in dt
      would look like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
        tty0                 -WU (E  p  )    4:1
      
      Where as after the reverted commit, it looks like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
      
      This commit reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path") restoring the original
      behavior.
      
      Fixes: 05fd007e ("console: don't prefer first registered if DT specifies stdout-path")
      Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.comSigned-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6c7d83b
  10. 08 11月, 2016 3 次提交
    • T
      genirq: Use irq type from irqdata instead of irqdesc · 7ee7e87d
      Thomas Gleixner 提交于
      The type flags in the irq descriptor are there for historical reasons and
      only updated via irq_modify_status() or irq_set_type(). Both functions also
      update the type flags in irqdata. __setup_irq() is the only left over user
      of the type flags in the irq descriptor.
      
      If __setup_irq() is called with empty irq type flags, then the type flags
      are retrieved from irqdata. If an interrupt is shared, then the type flags
      are compared with the type flags stored in the irq descriptor. 
      
      On x86 the ioapic does not have a irq_set_type() callback because the type
      is defined in the BIOS tables and cannot be changed. The type is stored in
      irqdata at setup time without updating the type data in the irq
      descriptor. As a result the comparison described above fails.
      
      There is no point in updating the irq descriptor flags because the only
      relevant storage is irqdata. Use the type flags from irqdata for both
      retrieval and comparison in __setup_irq() instead.
      
      Aside of that the print out in case of non matching type flags has the old
      and new type flags arguments flipped. Fix that as well.
      
      For correctness sake the flags stored in the irq descriptor should be
      removed, but this is beyond the scope of this bugfix and will be done in a
      later patch.
      
      Fixes: 4b357dae ("genirq: Look-up trigger type if not specified by caller")
      Reported-and-tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7ee7e87d
    • D
      bpf: fix map not being uncharged during map creation failure · 20b2b24f
      Daniel Borkmann 提交于
      In map_create(), we first find and create the map, then once that
      suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
      a new anon fd through anon_inode_getfd(). The problem is, once the
      latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
      the map via map->ops->map_free(), but without uncharging the previously
      locked memory first. That means that the user_struct allocation is
      leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
      Make the label names in the fix consistent with bpf_prog_load().
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20b2b24f
    • D
      bpf: fix htab map destruction when extra reserve is in use · 483bed2b
      Daniel Borkmann 提交于
      Commit a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      added an extra per-cpu reserve to the hash table map to restore old
      behaviour from pre prealloc times. When non-prealloc is in use for a
      map, then problem is that once a hash table extra element has been
      linked into the hash-table, and the hash table is destroyed due to
      refcount dropping to zero, then htab_map_free() -> delete_all_elements()
      will walk the whole hash table and drop all elements via htab_elem_free().
      The problem is that the element from the extra reserve is first fed
      to the wrong backend allocator and eventually freed twice.
      
      Fixes: a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      483bed2b
  11. 04 11月, 2016 1 次提交
  12. 03 11月, 2016 2 次提交
  13. 02 11月, 2016 1 次提交
  14. 01 11月, 2016 1 次提交
  15. 28 10月, 2016 4 次提交
    • J
      perf/powerpc: Don't call perf_event_disable() from atomic context · 5aab90ce
      Jiri Olsa 提交于
      The trinity syscall fuzzer triggered following WARN() on powerpc:
      
        WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
        ...
        NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
        LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
        Call Trace:
        [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      Followed by a lockdep warning:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.8.0-rc5+ #7 Tainted: G        W
        -------------------------------
        ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 0
        2 locks held by ls/2998:
         #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
         #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
      
        stack backtrace:
        CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
        Call Trace:
        [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
        [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
        [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
        [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
        [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
        [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
        [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      While it looks like the first WARN() is probably valid, the other one is
      triggered by disabling event via perf_event_disable() from atomic context.
      
      The event is disabled here in case we were not able to emulate
      the instruction that hit the breakpoint. By disabling the event
      we unschedule the event and make sure it's not scheduled back.
      
      But we can't call perf_event_disable() from atomic context, instead
      we need to use the event's pending_disable irq_work method to disable it.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161026094824.GA21397@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5aab90ce
    • J
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix... · 0933840a
      Jiri Olsa 提交于
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
      
      CAI Qian reported a crash in the PMU uncore device removal code,
      enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:
      
        https://marc.info/?l=linux-kernel&m=147688837328451
      
      The reason for the crash is that perf_pmu_unregister() tries to remove
      a PMU device which is not added at this point. We add PMU devices
      only after pmu_bus is registered, which happens in the
      perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
      
      The fix is to get the 'pmu_bus_running' flag state at the point
      the PMU is taken out of the PMU list and remove the device
      later only if it's set.
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161020111011.GA13361@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0933840a
    • A
      kcov: properly check if we are in an interrupt · b274c0bb
      Andrey Konovalov 提交于
      in_interrupt() returns a nonzero value when we are either in an
      interrupt or have bh disabled via local_bh_disable().  Since we are
      interested in only ignoring coverage from actual interrupts, do a proper
      check instead of just calling in_interrupt().
      
      As a result of this change, kcov will start to collect coverage from
      within local_bh_disable()/local_bh_enable() sections.
      
      Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Nicolai Stange <nicstange@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b274c0bb
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  16. 27 10月, 2016 1 次提交
  17. 25 10月, 2016 4 次提交
    • T
      timers: Prevent base clock corruption when forwarding · 6bad6bcc
      Thomas Gleixner 提交于
      When a timer is enqueued we try to forward the timer base clock. This
      mechanism has two issues:
      
      1) Forwarding a remote base unlocked
      
      The forwarding function is called from get_target_base() with the current
      timer base lock held. But if the new target base is a different base than
      the current base (can happen with NOHZ, sigh!) then the forwarding is done
      on an unlocked base. This can lead to corruption of base->clk.
      
      Solution is simple: Invoke the forwarding after the target base is locked.
      
      2) Possible corruption due to jiffies advancing
      
      This is similar to the issue in get_net_timer_interrupt() which was fixed
      in the previous patch. jiffies can advance between check and assignement
      and therefore advancing base->clk beyond the next expiry value.
      
      So we need to read jiffies into a local variable once and do the checks and
      assignment with the local copy.
      
      Fixes: a683f390("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6bad6bcc
    • T
      timers: Prevent base clock rewind when forwarding clock · 041ad7bc
      Thomas Gleixner 提交于
      Ashton and Michael reported, that kernel versions 4.8 and later suffer from
      USB timeouts which are caused by the timer wheel rework.
      
      This is caused by a bug in the base clock forwarding mechanism, which leads
      to timers expiring early. The scenario which leads to this is:
      
      run_timers()
        while (jiffies >= base->clk) {
          collect_expired_timers();
          base->clk++;
          expire_timers();
        }          
      
      So base->clk = jiffies + 1. Now the cpu goes idle:
      
      idle()
        get_next_timer_interrupt()
          nextevt = __next_time_interrupt();
          if (time_after(nextevt, base->clk))
             	base->clk = jiffies;
      
      jiffies has not advanced since run_timers(), so this assignment effectively
      decrements base->clk by one.
      
      base->clk is the index into the timer wheel arrays. So let's assume the
      following state after the base->clk increment in run_timers():
      
       jiffies = 0
       base->clk = 1
      
      A timer gets enqueued with an expiry delta of 63 ticks (which is the case
      with the USB timeout and HZ=250) so the resulting bucket index is:
      
        base->clk + delta = 1 + 63 = 64
      
      The timer goes into the first wheel level. The array size is 64 so it ends
      up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
      to index into bucket 0 again.
      
      If the cpu goes idle before jiffies advance, then the bug in the forwarding
      mechanism sets base->clk back to 0, so the next invocation of run_timers()
      at the next tick will index into bucket 0 and therefore expire the timer 62
      ticks too early.
      
      Instead of blindly setting base->clk to jiffies we must make the forwarding
      conditional on jiffies > base->clk, but we cannot use jiffies for this as
      we might run into the following issue:
      
        if (time_after(jiffies, base->clk) {
          if (time_after(nextevt, base->clk))
             base->clk = jiffies;
      
      jiffies can increment between the check and the assigment far enough to
      advance beyond nextevt. So we need to use a stable value for checking.
      
      get_next_timer_interrupt() has the basej argument which is the jiffies
      value snapshot taken in the calling code. So we can just that.
      
      Thanks to Ashton for bisecting and providing trace data!
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      041ad7bc
    • T
      timers: Lock base for same bucket optimization · 4da9152a
      Thomas Gleixner 提交于
      Linus stumbled over the unlocked modification of the timer expiry value in
      mod_timer() which is an optimization for timers which stay in the same
      bucket - due to the bucket granularity - despite their expiry time getting
      updated.
      
      The optimization itself still makes sense even if we take the lock, because
      in case that the bucket stays the same, we avoid the pointless
      queue/enqueue dance.
      
      Make the check and the modification of timer->expires protected by the base
      lock and shuffle the remaining code around so we can keep the lock held
      when we actually have to requeue the timer to a different bucket.
      
      Fixes: f00c0afd ("timers: Implement optimization for same expiry time in mod_timer()")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4da9152a
    • T
      timers: Plug locking race vs. timer migration · b831275a
      Thomas Gleixner 提交于
      Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
      timer flags. As a consequence the compiler is allowed to reload the flags
      between the initial check for TIMER_MIGRATION and the following timer base
      computation and the spin lock of the base.
      
      While this has not been observed (yet), we need to make sure that it never
      happens.
      
      Fixes: 0eeda71b ("timer: Replace timer base by a cpu index")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      b831275a
  18. 24 10月, 2016 1 次提交
    • J
      PM / suspend: Fix missing KERN_CONT for suspend message · 1adb469b
      Jon Hunter 提交于
      Commit 4bcc595c (printk: reinstate KERN_CONT for printing
      continuation lines) exposed a missing KERN_CONT from one of the
      messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
      after syncing the filesystems no longer appears as a continuation but
      a new message with its own timestamp.
      
      [    9.259566] PM: Syncing filesystems ... [    9.264119] done.
      
      Fix this by adding the KERN_CONT log level for the 'done.' part of the
      message seen after syncing filesystems. While we are at it, convert
      these suspend printks to pr_info and pr_cont, respectively.
      Signed-off-by: NJon Hunter <jonathanh@nvidia.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1adb469b
  19. 23 10月, 2016 1 次提交
  20. 22 10月, 2016 1 次提交
  21. 21 10月, 2016 2 次提交
  22. 20 10月, 2016 1 次提交
    • L
      printk: suppress empty continuation lines · 8835ca59
      Linus Torvalds 提交于
      We have a fairly common pattern where you print several things as
      continuations on one single line in a loop, and then at the end you do
      
      	printk(KERN_CONT "\n");
      
      to flush the buffered output.
      
      But if the output was flushed by something else (concurrent printk
      activity, or just system logging), we don't want that final flushing to
      just print an empty line.
      
      So just suppress empty continuation lines when they couldn't be merged
      into the line they are a continuation of.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8835ca59
  23. 19 10月, 2016 3 次提交
  24. 17 10月, 2016 1 次提交