1. 02 3月, 2017 1 次提交
  2. 28 2月, 2017 2 次提交
  3. 03 2月, 2017 1 次提交
  4. 01 2月, 2017 11 次提交
  5. 28 1月, 2017 1 次提交
  6. 24 1月, 2017 1 次提交
    • N
      inotify: Convert to using per-namespace limits · 1cce1eea
      Nikolay Borisov 提交于
      This patchset converts inotify to using the newly introduced
      per-userns sysctl infrastructure.
      
      Currently the inotify instances/watches are being accounted in the
      user_struct structure. This means that in setups where multiple
      users in unprivileged containers map to the same underlying
      real user (i.e. pointing to the same user_struct) the inotify limits
      are going to be shared as well, allowing one user(or application) to exhaust
      all others limits.
      
      Fix this by switching the inotify sysctls to using the
      per-namespace/per-user limits. This will allow the server admin to
      set sensible global limits, which can further be tuned inside every
      individual user namespace. Additionally, in order to preserve the
      sysctl ABI make the existing inotify instances/watches sysctls
      modify the values of the initial user namespace.
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1cce1eea
  7. 22 1月, 2017 1 次提交
    • W
      locking/rwsem: Reinit wake_q after use · bcc9a76d
      Waiman Long 提交于
      In __rwsem_down_write_failed_common(), the same wake_q variable name
      is defined twice, with the inner wake_q hiding the one in outer scope.
      We can either use different names for the two wake_q's.
      
      Even better, we can use the same wake_q twice, if necessary.
      
      To enable the latter change, we need to define a new helper function
      wake_q_init() to enable reinitalization of wake_q after use.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1485052415-9611-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bcc9a76d
  8. 20 1月, 2017 1 次提交
    • P
      sched/clock: Fix hotplug crash · acb04058
      Peter Zijlstra 提交于
      Mike reported that he could trigger the WARN_ON_ONCE() in
      set_sched_clock_stable() using hotplug.
      
      This exposed a fundamental problem with the interface, we should never
      mark the TSC stable if we ever find it to be unstable. Therefore
      set_sched_clock_stable() is a broken interface.
      
      The reason it existed is that not having it is a pain, it means all
      relevant architecture code needs to call clear_sched_clock_stable()
      where appropriate.
      
      Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
      and parisc are trivial in that they never called
      set_sched_clock_stable(), so add an unconditional call to
      clear_sched_clock_stable() to them.
      
      For x86 the story is a lot more involved, and what this patch tries to
      do is ensure we preserve the status quo. So even is Cyrix or Transmeta
      have usable TSC they never called set_sched_clock_stable() so they now
      get an explicit mark unstable.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 9881b024 ("sched/clock: Delay switching sched_clock to stable")
      Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      acb04058
  9. 14 1月, 2017 3 次提交
    • T
      sched/core: Separate out io_schedule_prepare() and io_schedule_finish() · 10ab5643
      Tejun Heo 提交于
      Now that IO schedule accounting is done inside __schedule(),
      io_schedule() can be split into three steps - prep, schedule, and
      finish - where the schedule part doesn't need any special annotation.
      This allows marking a sleep as iowait by simply wrapping an existing
      blocking function with io_schedule_prepare() and io_schedule_finish().
      
      Because task_struct->in_iowait is single bit, the caller of
      io_schedule_prepare() needs to record and the pass its state to
      io_schedule_finish() to be safe regarding nesting.  While this isn't
      the prettiest, these functions are mostly gonna be used by core
      functions and we don't want to use more space for ->in_iowait.
      
      While at it, as it's simple to do now, reimplement io_schedule()
      without unnecessarily going through io_schedule_timeout().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: adilger.kernel@dilger.ca
      Cc: jack@suse.com
      Cc: kernel-team@fb.com
      Cc: mingbo@fb.com
      Cc: tytso@mit.edu
      Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10ab5643
    • P
      sched/clock: Delay switching sched_clock to stable · 9881b024
      Peter Zijlstra 提交于
      Currently we switch to the stable sched_clock if we guess the TSC is
      usable, and then switch back to the unstable path if it turns out TSC
      isn't stable during SMP bringup after all.
      
      Delay switching to the stable path until after SMP bringup is
      complete. This way we'll avoid switching during the time we detect the
      worst of the TSC offences.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9881b024
    • D
      sched/core: Remove set_task_state() · 642fa448
      Davidlohr Bueso 提交于
      This is a nasty interface and setting the state of a foreign task must
      not be done. As of the following commit:
      
        be628be0 ("bcache: Make gc wakeup sane, remove set_task_state()")
      
      ... everyone in the kernel calls set_task_state() with current, allowing
      the helper to be removed.
      
      However, as the comment indicates, it is still around for those archs
      where computing current is more expensive than using a pointer, at least
      in theory. An important arch that is affected is arm64, however this has
      been addressed now [1] and performance is up to par making no difference
      with either calls.
      
      Of all the callers, if any, it's the locking bits that would care most
      about this -- ie: we end up passing a tsk pointer to a lot of the lock
      slowpath, and setting ->state on that. The following numbers are based
      on two tests: a custom ad-hoc microbenchmark that just measures
      latencies (for ~65 million calls) between get_task_state() vs
      get_current_state().
      
      Secondly for a higher overview, an unlink microbenchmark was used,
      which pounds on a single file with open, close,unlink combos with
      increasing thread counts (up to 4x ncpus). While the workload is quite
      unrealistic, it does contend a lot on the inode mutex or now rwsem.
      
      [1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com
      
      == 1. x86-64 ==
      
      Avg runtime set_task_state():    601 msecs
      Avg runtime set_current_state(): 552 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
      Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
      Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
      Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
      Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
      Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
      Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
      Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
      Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
      Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
      Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
      Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
      Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)
      
      == 2. ppc64le ==
      
      Avg runtime set_task_state():  938 msecs
      Avg runtime set_current_state: 940 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
      Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
      Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
      Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
      Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
      Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
      Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
      Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
      Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
      Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
      Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
      Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
      Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
      Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
      Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
      Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)
      
      x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
      and ppc64 (with paca) show similar improvements in the unlink microbenches.
      The small delta for ppc64 (2ms), does not represent the gains on the unlink
      runs. In the case of x86, there was a decent amount of variation in the
      latency runs, but always within a 20 to 50ms increase), ppc was more constant.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: mark.rutland@arm.com
      Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      642fa448
  10. 11 1月, 2017 1 次提交
  11. 13 12月, 2016 1 次提交
  12. 29 11月, 2016 1 次提交
    • P
      sched/idle: Add support for tasks that inject idle · c1de45ca
      Peter Zijlstra 提交于
      Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
      realtime tasks to take control of CPU then inject idle. There are two
      issues with this approach:
      
       1. Low efficiency: injected idle task is treated as busy so sched ticks
          do not stop during injected idle period, the result of these
          unwanted wakeups can be ~20% loss in power savings.
      
       2. Idle accounting: injected idle time is presented to user as busy.
      
      This patch addresses the issues by introducing a new PF_IDLE flag which
      allows any given task to be treated as idle task while the flag is set.
      Therefore, idle injection tasks can run through the normal flow of NOHZ
      idle enter/exit to get the correct accounting as well as tick stop when
      possible.
      
      The implication is that idle task is then no longer limited to PID == 0.
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c1de45ca
  13. 24 11月, 2016 1 次提交
  14. 23 11月, 2016 1 次提交
    • E
      ptrace: Capture the ptracer's creds not PT_PTRACE_CAP · 64b875f7
      Eric W. Biederman 提交于
      When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
      overlooked.  This can result in incorrect behavior when an application
      like strace traces an exec of a setuid executable.
      
      Further PT_PTRACE_CAP does not have enough information for making good
      security decisions as it does not report which user namespace the
      capability is in.  This has already allowed one mistake through
      insufficient granulariy.
      
      I found this issue when I was testing another corner case of exec and
      discovered that I could not get strace to set PT_PTRACE_CAP even when
      running strace as root with a full set of caps.
      
      This change fixes the above issue with strace allowing stracing as
      root a setuid executable without disabling setuid.  More fundamentaly
      this change allows what is allowable at all times, by using the correct
      information in it's decision.
      
      Cc: stable@vger.kernel.org
      Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      64b875f7
  15. 22 11月, 2016 2 次提交
  16. 21 11月, 2016 1 次提交
  17. 17 11月, 2016 1 次提交
  18. 15 11月, 2016 2 次提交
  19. 31 10月, 2016 1 次提交
    • F
      x86/intel_rdt: Add tasks files · e02737d5
      Fenghua Yu 提交于
      The root directory all subdirectories are automatically populated with a
      read/write (mode 0644) file named "tasks". When read it will show all the
      task IDs assigned to the resource group. Tasks can be added (one at a time)
      to a group by writing the task ID to the file.  E.g.
      
      Membership in a resource group is indicated by a new field in the
      task_struct "int closid" which holds the CLOSID for each task. The default
      resource group uses CLOSID=0 which means that all existing tasks when the
      resctrl file system is mounted belong to the default group.
      
      If a group is removed, tasks which are members of that group are moved to
      the default group.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Shaohua Li" <shli@fb.com>
      Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Stephane Eranian" <eranian@google.com>
      Cc: "Dave Hansen" <dave.hansen@intel.com>
      Cc: "David Carrillo-Cisneros" <davidcc@google.com>
      Cc: "Nilay Vaish" <nilayvaish@gmail.com>
      Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
      Cc: "Ingo Molnar" <mingo@elte.hu>
      Cc: "Borislav Petkov" <bp@suse.de>
      Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
      Link: http://lkml.kernel.org/r/1477692289-37412-8-git-send-email-fenghua.yu@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e02737d5
  20. 25 10月, 2016 1 次提交
  21. 08 10月, 2016 5 次提交
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • M
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko 提交于
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • M
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko 提交于
      Lockdep complains that __mmdrop is not safe from the softirq context:
      
        =================================
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        ---------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
           __lock_acquire+0xa06/0x196e
           lock_acquire+0x139/0x1e1
           _raw_spin_lock+0x32/0x41
           __change_page_attr_set_clr+0x2a5/0xacd
           change_page_attr_set_clr+0x16f/0x32c
           set_memory_nx+0x37/0x3a
           free_init_pages+0x9e/0xc7
           alternative_instructions+0xa2/0xb3
           check_bugs+0xe/0x2d
           start_kernel+0x3ce/0x3ea
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x17a/0x18d
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(pgd_lock);
          <Interrupt>
            lock(pgd_lock);
      
         *** DEADLOCK ***
      
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
      
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
         <IRQ>
          print_usage_bug.part.25+0x259/0x268
          mark_lock+0x381/0x567
          __lock_acquire+0x993/0x196e
          lock_acquire+0x139/0x1e1
          _raw_spin_lock+0x32/0x41
          pgd_free+0x19/0x6b
          __mmdrop+0x25/0xb9
          __put_task_struct+0x103/0x11e
          delayed_put_task_struct+0x157/0x15e
          rcu_process_callbacks+0x660/0x800
          __do_softirq+0x1ec/0x4d5
          irq_exit+0x6f/0xd1
          smp_apic_timer_interrupt+0x42/0x4d
          apic_timer_interrupt+0x8e/0xa0
         <EOI>
          arch_cpu_idle+0xf/0x11
          default_idle_call+0x32/0x34
          cpu_startup_entry+0x20c/0x399
          start_secondary+0xfe/0x101
      
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      
      This fixes bug introduced by "oom: keep mm of the killed task available"
      
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7283094e
    • M
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko 提交于
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1