1. 15 5月, 2017 10 次提交
    • L
      sched/topology: Refactor function build_overlap_sched_groups() · 8c033469
      Lauro Ramos Venancio 提交于
      Create functions build_group_from_child_sched_domain() and
      init_overlap_sched_group(). No functional change.
      Signed-off-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c033469
    • P
      sched/clock: Print a warning recommending 'tsc=unstable' · 7708d5f0
      Peter Zijlstra 提交于
      With our switch to stable delayed until late_initcall(), the most
      likely cause of hitting mark_tsc_unstable() is the watchdog. The
      watchdog typically only triggers when creative BIOS'es fiddle with the
      TSC to hide SMI latency.
      
      Since the watchdog can only detect TSC fiddling after the fact all TSC
      clocks (including userspace GTOD) can already have reported funny
      values.
      
      The only way to fully avoid this, is manually marking the TSC unstable
      at boot. Suggest people do this on their broken systems.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7708d5f0
    • P
      sched/clock: Use late_initcall() instead of sched_init_smp() · 2e44b7dd
      Peter Zijlstra 提交于
      Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
      after sched_init_smp(). Luckily it appears both acpi_processor and
      intel_idle (which has a similar check) are mandatory built-in.
      
      This means we can delay switching to stable until after these drivers
      have ran (if they were modules, this would be impossible).
      
      Delay the stable switch to late_initcall() to allow these drivers to
      mark TSC unstable and avoid difficult stable->unstable transitions.
      Reported-by: NLofstedt, Marta <marta.lofstedt@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2e44b7dd
    • P
      cpuidle: Fix idle time tracking · f9fccdb9
      Peter Zijlstra 提交于
      Ville reported that on his Core2, which has TSC stop in idle, we would
      always report very short idle durations. He tracked this down to
      commit:
      
        e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      
      which replaces ktime_get() with local_clock().
      
      Add a sched_clock_idle_wakeup_event() call, which will re-sync the
      clock with ktime_get_ns() when TSC is unstable and no-op otherwise.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Tested-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f9fccdb9
    • P
      sched/clock: Remove watchdog touching · 3067a33d
      Peter Zijlstra 提交于
      Commit:
      
        2bacec8c ("sched: touch softlockup watchdog after idling")
      
      introduced the touch_softlockup_watchdog_sched() call without
      justification and I feel sched_clock management is not the right
      place, it should only be concerned with producing semi coherent time.
      
      If this causes watchdog thingies, we can find a better place.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3067a33d
    • P
      sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() · ac1e843f
      Peter Zijlstra 提交于
      The argument to sched_clock_idle_wakeup_event() has not been used in a
      long time. Remove it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac1e843f
    • P
      x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points · b421b22b
      Peter Zijlstra 提交于
      Currently we keep sched_clock_tick() active for stable TSC in order to
      keep the per-CPU state semi up-to-date. The (obvious) problem is that
      by the time we detect TSC is borked, our per-CPU state is also borked.
      
      So hook into the clocksource watchdog and call a method after we've
      found it to still be stable.
      
      There's the obvious race where the TSC goes wonky between finding it
      stable and us running the callback, but closing that is too much work
      and not really worth it, since we're already detecting TSC wobbles
      after the fact, so we cannot, per definition, fully avoid funny clock
      values.
      
      And since the watchdog runs less often than the tick, this is also an
      optimization.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b421b22b
    • P
      sched/clock: Initialize all per-CPU state before switching (back) to unstable · cf15ca8d
      Peter Zijlstra 提交于
      In preparation for not keeping the sched_clock_tick() active for
      stable TSC, we need to explicitly initialize all per-CPU state
      before switching back to unstable.
      
      Note: this patch looses the __gtod_offset calculation; it will be
      restored in the next one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cf15ca8d
    • V
      sched/cfs: Make util/load_avg more stable · 625ed2bf
      Vincent Guittot 提交于
      In the current implementation of load/util_avg, we assume that the
      ongoing time segment has fully elapsed, and util/load_sum is divided
      by LOAD_AVG_MAX, even if part of the time segment still remains to
      run. As a consequence, this remaining part is considered as idle time
      and generates unexpected variations of util_avg of a busy CPU in the
      range [1002..1024[ whereas util_avg should stay at 1023.
      
      In order to keep the metric stable, we should not consider the ongoing
      time segment when computing load/util_avg but only the segments that
      have already fully elapsed. But to not consider the current time
      segment adds unwanted latency in the load/util_avg responsivness
      especially when the time is scaled instead of the contribution.
      
      Instead of waiting for the current time segment to have fully elapsed
      before accounting it in load/util_avg, we can already account the
      elapsed part but change the range used to compute load/util_avg
      accordingly.
      
      At the very beginning of a new time segment, the past segments have
      been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
      the current time segment, the max value becomes:
      
        LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)
      
      In fact, the max value is:
      
        LOAD_AVG_MAX*y + sa->period_contrib
      
      at any time in the time segment.
      
      Taking advantage of the fact that:
      
        LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024
      
      the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].
      
      As the elapsed part is already accounted in load/util_sum, we update
      the max value according to the current position in the time segment
      instead of removing its contribution.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      625ed2bf
    • S
      sched/core: Call __schedule() from do_idle() without enabling preemption · 8663effb
      Steven Rostedt (VMware) 提交于
      I finally got around to creating trampolines for dynamically allocated
      ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
      function hook callbacks, like perf, that allocate the ftrace_ops
      descriptor via kmalloc() and friends, ftrace was not able to optimize
      the functions being traced to use a trampoline because they would also
      need to be allocated dynamically. The problem is that they cannot be
      freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
      was preempted on the trampoline. That was before Paul McKenney
      implemented synchronize_rcu_tasks() that would make sure all tasks
      (except idle) have scheduled out or have entered user space.
      
      While testing this, I triggered this bug:
      
       BUG: unable to handle kernel paging request at ffffffffa0230077
       ...
       RIP: 0010:0xffffffffa0230077
       ...
       Call Trace:
        schedule+0x5/0xe0
        schedule_preempt_disabled+0x18/0x30
        do_idle+0x172/0x220
      
      What happened was that the idle task was preempted on the trampoline.
      As synchronize_rcu_tasks() ignores the idle thread, there's nothing
      that lets ftrace know that the idle task was preempted on a trampoline.
      
      The idle task shouldn't need to ever enable preemption. The idle task
      is simply a loop that calls schedule or places the cpu into idle mode.
      In fact, having preemption enabled is inefficient, because it can
      happen when idle is just about to call schedule anyway, which would
      cause schedule to be called twice. Once for when the interrupt came in
      and was returning back to normal context, and then again in the normal
      path that the idle loop is running in, which would be pointless, as it
      had already scheduled.
      
      The only reason schedule_preempt_disable() enables preemption is to be
      able to call sched_submit_work(), which requires preemption enabled. As
      this is a nop when the task is in the RUNNING state, and idle is always
      in the running state, there's no reason that idle needs to enable
      preemption. But that means it cannot use schedule_preempt_disable() as
      other callers of that function require calling sched_submit_work().
      
      Adding a new function local to kernel/sched/ that allows idle to call
      the scheduler without enabling preemption, fixes the
      synchronize_rcu_tasks() issue, as well as removes the pointless spurious
      schedule calls caused by interrupts happening in the brief window where
      preemption is enabled just before it calls schedule.
      
      Reviewed: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8663effb
  2. 13 5月, 2017 2 次提交
  3. 10 5月, 2017 1 次提交
    • W
      perf/callchain: Force USER_DS when invoking perf_callchain_user() · 88b0193d
      Will Deacon 提交于
      Perf can generate and record a user callchain in response to a synchronous
      request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS),
      then we can end up walking the user stack (and dereferencing/saving whatever we
      find there) without the protections usually afforded by checks such as
      access_ok.
      
      Rather than play whack-a-mole with each architecture's stack unwinding
      implementation, fix the root of the problem by ensuring that we force USER_DS
      when invoking perf_callchain_user from the perf core.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88b0193d
  4. 09 5月, 2017 15 次提交
  5. 06 5月, 2017 1 次提交
    • R
      ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle · eed4d47e
      Rafael J. Wysocki 提交于
      The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ
      during suspend-to-idle transitions and, consequently, any events
      signaled through it wake up the system from that state.  However,
      on some systems some of the events signaled via the ACPI SCI while
      suspended to idle should not cause the system to wake up.  In fact,
      quite often they should just be discarded.
      
      Arguably, systems should not resume entirely on such events, but in
      order to decide which events really should cause the system to resume
      and which are spurious, it is necessary to resume up to the point
      when ACPI SCIs are actually handled and processed, which is after
      executing dpm_resume_noirq() in the system resume path.
      
      For this reasons, add a loop around freeze_enter() in which the
      platforms can process events signaled via multiplexed IRQ lines
      like the ACPI SCI and add suspend-to-idle hooks that can be
      used for this purpose to struct platform_freeze_ops.
      
      In the ACPI case, the ->wake hook is used for checking if the SCI
      has triggered while suspended and deferring the interrupt-induced
      system wakeup until the events signaled through it are actually
      processed sufficiently to decide whether or not the system should
      resume.  In turn, the ->sync hook allows all of the relevant event
      queues to be flushed so as to prevent events from being missed due
      to race conditions.
      
      In addition to that, some ACPI code processing wakeup events needs
      to be modified to use the "hard" version of wakeup triggers, so that
      it will cause a system resume to happen on device-induced wakeup
      events even if the "soft" mechanism to prevent the system from
      suspending is not enabled (that also helps to catch device-induced
      wakeup events occurring during suspend transitions in progress).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      eed4d47e
  6. 05 5月, 2017 1 次提交
  7. 04 5月, 2017 6 次提交
    • S
      ftrace: Simplify ftrace_match_record() even more · 77c0edde
      Steven Rostedt (VMware) 提交于
      Dan Carpenter sent a patch to remove a check in ftrace_match_record()
      because the logic of the code made the check redundant. I looked deeper into
      the code, and made the following logic table, with the three variables and
      the result of the original code.
      
      modname        mod_matches     exclude_mod         result
      -------        -----------     -----------         ------
        0                 0               0              return 0
        0                 0               1              func_match
        0                 1               *             < cannot exist >
        1                 0               0              return 0
        1                 0               1              func_match
        1                 1               0              func_match
        1                 1               1              return 0
      
      Notice that when mod_matches == exclude mod, the result is always to
      return 0, and when mod_matches != exclude_mod, then the result is to test
      the function. This means we only need test if mod_matches is equal to
      exclude_mod.
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      77c0edde
    • D
      ftrace: Remove an unneeded condition · 31805c90
      Dan Carpenter 提交于
      We know that "mod_matches" is true here so there is no need to check
      again.
      
      Link: http://lkml.kernel.org/r/20170331152130.GA4947@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      31805c90
    • A
      tracing: Use strlcpy() instead of strcpy() in __trace_find_cmdline() · e09e2867
      Amey Telawane 提交于
      Strcpy is inherently not safe, and strlcpy() should be used instead.
      __trace_find_cmdline() uses strcpy() because the comms saved must have a
      terminating nul character, but it doesn't hurt to add the extra protection
      of using strlcpy() instead of strcpy().
      
      Link: http://lkml.kernel.org/r/1493806274-13936-1-git-send-email-amit.pundir@linaro.orgSigned-off-by: NAmey Telawane <ameyt@codeaurora.org>
      [AmitP: Cherry-picked this commit from CodeAurora kernel/msm-3.10
      https://source.codeaurora.org/quic/la/kernel/msm-3.10/commit/?id=2161ae9a70b12cf18ac8e5952a20161ffbccb477]
      Signed-off-by: NAmit Pundir <amit.pundir@linaro.org>
      [ Updated change log and removed the "- 1" from len parameter ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      e09e2867
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • M
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko 提交于
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • N
      lockdep: teach lockdep about memalloc_noio_save · 6d7225f0
      Nikolay Borisov 提交于
      Patch series "scope GFP_NOFS api", v5.
      
      This patch (of 7):
      
      Commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") added the memalloc_noio_(save|restore)
      functions to enable people to modify the MM behavior by disabling I/O
      during memory allocation.
      
      This was further extended in commit 934f3072 ("mm: clear __GFP_FS
      when PF_MEMALLOC_NOIO is set").
      
      memalloc_noio_* functions prevent allocation paths recursing back into
      the filesystem without explicitly changing the flags for every
      allocation site.
      
      However, lockdep hasn't been keeping up with the changes and it entirely
      misses handling the memalloc_noio adjustments.  Instead, it is left to
      the callers of __lockdep_trace_alloc to call the function after they
      have shaven the respective GFP flags which can lead to false positives:
      
        =================================
         [ INFO: inconsistent lock state ]
         4.10.0-nbor #134 Not tainted
         ---------------------------------
         inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
         fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
          (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
         {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x62a/0x17c0
           lock_acquire+0xc5/0x220
           down_write_nested+0x4f/0x90
           xfs_ilock+0x141/0x230
           xfs_reclaim_inode+0x12a/0x320
           xfs_reclaim_inodes_ag+0x2c8/0x4e0
           xfs_reclaim_inodes_nr+0x33/0x40
           xfs_fs_free_cached_objects+0x19/0x20
           super_cache_scan+0x191/0x1a0
           shrink_slab+0x26f/0x5f0
           shrink_node+0xf9/0x2f0
           kswapd+0x356/0x920
           kthread+0x10c/0x140
           ret_from_fork+0x31/0x40
         irq event stamp: 173777
         hardirqs last  enabled at (173777): __local_bh_enable_ip+0x70/0xc0
         hardirqs last disabled at (173775): __local_bh_enable_ip+0x37/0xc0
         softirqs last  enabled at (173776): _xfs_buf_find+0x67a/0xb70
         softirqs last disabled at (173774): _xfs_buf_find+0x5db/0xb70
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&xfs_nondir_ilock_class);
           <Interrupt>
             lock(&xfs_nondir_ilock_class);
      
          *** DEADLOCK ***
      
         4 locks held by fsstress/3365:
          #0:  (sb_writers#10){++++++}, at: mnt_want_write+0x24/0x50
          #1:  (&sb->s_type->i_mutex_key#12){++++++}, at: vfs_setxattr+0x6f/0xb0
          #2:  (sb_internal#2){++++++}, at: xfs_trans_alloc+0xfc/0x140
          #3:  (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
      
         stack backtrace:
         CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
         Call Trace:
          kmem_cache_alloc_node_trace+0x3a/0x2c0
          vm_map_ram+0x2a1/0x510
          _xfs_buf_map_pages+0x77/0x140
          xfs_buf_get_map+0x185/0x2a0
          xfs_attr_rmtval_set+0x233/0x430
          xfs_attr_leaf_addname+0x2d2/0x500
          xfs_attr_set+0x214/0x420
          xfs_xattr_set+0x59/0xb0
          __vfs_setxattr+0x76/0xa0
          __vfs_setxattr_noperm+0x5e/0xf0
          vfs_setxattr+0xae/0xb0
          setxattr+0x15e/0x1a0
          path_setxattr+0x8f/0xc0
          SyS_lsetxattr+0x11/0x20
          entry_SYSCALL_64_fastpath+0x23/0xc6
      
      Let's fix this by making lockdep explicitly do the shaving of respective
      GFP flags.
      
      Fixes: 934f3072 ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
      Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.orgSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d7225f0
  8. 03 5月, 2017 2 次提交
  9. 02 5月, 2017 2 次提交