1. 28 10月, 2014 12 次提交
    • R
      sched/numa: Calculate node scores in complex NUMA topologies · 6c6b1193
      Rik van Riel 提交于
      In order to do task placement on systems with complex NUMA topologies,
      it is necessary to count the faults on nodes nearby the node that is
      being examined for a potential move.
      
      In case of a system with a backplane interconnect, we are dealing with
      groups of NUMA nodes; each of the nodes within a group is the same number
      of hops away from nodes in other groups in the system. Optimal placement
      on this topology is achieved by counting all nearby nodes equally. When
      comparing nodes A and B at distance N, nearby nodes are those at distances
      smaller than N from nodes A or B.
      
      Placement strategy on a system with a glueless mesh NUMA topology needs
      to be different, because there are no natural groups of nodes determined
      by the hardware. Instead, when dealing with two nodes A and B at distance
      N, N >= 2, there will be intermediate nodes at distance < N from both nodes
      A and B. Good placement can be achieved by right shifting the faults on
      nearby nodes by the number of hops from the node being scored. In this
      context, a nearby node is any node less than the maximum distance in the
      system away from the node. Those nodes are skipped for efficiency reasons,
      there is no real policy reason to do so.
      
      Placement policy on directly connected NUMA systems is not affected.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c6b1193
    • R
      sched/numa: Prepare for complex topology placement · 7bd95320
      Rik van Riel 提交于
      Preparatory patch for adding NUMA placement on systems with
      complex NUMA topology. Also fix a potential divide by zero
      in group_weight()
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7bd95320
    • R
      sched/numa: Classify the NUMA topology of a system · e3fe70b1
      Rik van Riel 提交于
      Smaller NUMA systems tend to have all NUMA nodes directly connected
      to each other. This includes the degenerate case of a system with just
      one node, ie. a non-NUMA system.
      
      Larger systems can have two kinds of NUMA topology, which affects how
      tasks and memory should be placed on the system.
      
      On glueless mesh systems, nodes that are not directly connected to
      each other will bounce traffic through intermediary nodes. Task groups
      can be run closer to each other by moving tasks from a node to an
      intermediary node between it and the task's preferred node.
      
      On NUMA systems with backplane controllers, the intermediary hops
      are incapable of running programs. This creates "islands" of nodes
      that are at an equal distance to anywhere else in the system.
      
      Each kind of topology requires a slightly different placement
      algorithm; this patch provides the mechanism to detect the kind
      of NUMA topology of a system.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      [ Changed to use kernel/sched/sched.h ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3fe70b1
    • R
      sched/numa: Export info needed for NUMA balancing on complex topologies · 9942f79b
      Rik van Riel 提交于
      Export some information that is necessary to do placement of
      tasks on systems with multi-level NUMA topologies.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9942f79b
    • K
      sched/dl: Fix preemption checks · f3a7e1a9
      Kirill Tkhai 提交于
      1) switched_to_dl() check is wrong. We reschedule only
         if rq->curr is deadline task, and we do not reschedule
         if it's a lower priority task. But we must always
         preempt a task of other classes.
      
      2) dl_task_timer():
         Policy does not change in case of priority inheritance.
         rt_mutex_setprio() changes prio, while policy remains old.
      
      So we lose some balancing logic in dl_task_timer() and
      switched_to_dl() when we check policy instead of priority. Boosted
      task may be rq->curr.
      
      (I didn't change switched_from_dl() because no check is necessary
      there at all).
      
      I've looked at this place(switched_to_dl) several times and even fixed
      this function, but found just now...  I suppose some performance tests
      may work better after this.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3a7e1a9
    • O
      sched: stop the unbound recursion in preempt_schedule_context() · 009f60e2
      Oleg Nesterov 提交于
      preempt_schedule_context() does preempt_enable_notrace() at the end
      and this can call the same function again; exception_exit() is heavy
      and it is quite possible that need-resched is true again.
      
      1. Change this code to dec preempt_count() and check need_resched()
         by hand.
      
      2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
         the enable/disable dance around __schedule(). But in this case
         we need to move into sched/core.c.
      
      3. Cosmetic, but x86 forgets to declare this function. This doesn't
         really matter because it is only called by asm helpers, still it
         make sense to add the declaration into asm/preempt.h to match
         preempt_schedule().
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      009f60e2
    • K
      sched/fair: Fix division by zero sysctl_numa_balancing_scan_size · 64192658
      Kirill Tkhai 提交于
      File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
      
      This bash command reproduces problem:
      
      $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
      	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
      
      	divide error: 0000 [#1] SMP
      	Modules linked in:
      	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
      	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
      	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
      	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
      	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
      	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
      	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
      	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
      	Stack:
      	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
      	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
      	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
      	Call Trace:
      	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
      	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
      	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
      	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
      	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
      	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
      	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
      	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
      	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
      	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
      	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
      	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
      	 [<ffffffff8150d322>] page_fault+0x22/0x30
      	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP <ffff880037a6bce0>
      	---[ end trace 9a826d16936c04de ]---
      
      Also fix race in task_scan_min (it depends on compiler behaviour).
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64192658
    • Y
      sched/fair: Care divide error in update_task_scan_period() · 2847c90e
      Yasuaki Ishimatsu 提交于
      While offling node by hot removing memory, the following divide error
      occurs:
      
        divide error: 0000 [#1] SMP
        [...]
        Call Trace:
         [...] handle_mm_fault
         [...] ? try_to_wake_up
         [...] ? wake_up_state
         [...] __do_page_fault
         [...] ? do_futex
         [...] ? put_prev_entity
         [...] ? __switch_to
         [...] do_page_fault
         [...] page_fault
        [...]
        RIP  [<ffffffff810a7081>] task_numa_fault
         RSP <ffff88084eb2bcb0>
      
      The issue occurs as follows:
        1. When page fault occurs and page is allocated from node 1,
           task_struct->numa_faults_buffer_memory[] of node 1 is
           incremented and p->numa_faults_locality[] is also incremented
           as follows:
      
           o numa_faults_buffer_memory[]       o numa_faults_locality[]
                    NR_NUMA_HINT_FAULT_TYPES
                   |      0     |     1     |
           ----------------------------------  ----------------------
            node 0 |      0     |     0     |   remote |      0     |
            node 1 |      0     |     1     |   locale |      1     |
           ----------------------------------  ----------------------
      
        2. node 1 is offlined by hot removing memory.
      
        3. When page fault occurs, fault_types[] is calculated by using
           p->numa_faults_buffer_memory[] of all online nodes in
           task_numa_placement(). But node 1 was offline by step 2. So
           the fault_types[] is calculated by using only
           p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
           are set to 0.
      
        4. The values(0) of fault_types[] pass to update_task_scan_period().
      
        5. numa_faults_locality[1] is set to 1. So the following division is
           calculated.
      
              static void update_task_scan_period(struct task_struct *p,
                                      unsigned long shared, unsigned long private){
              ...
                      ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
              }
      
        6. But both of private and shared are set to 0. So divide error
           occurs here.
      
      The divide error is rare case because the trigger is node offline.
      This patch always increments denominator for avoiding divide error.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2847c90e
    • K
      sched/numa: Fix unsafe get_task_struct() in task_numa_assign() · 1effd9f1
      Kirill Tkhai 提交于
      Unlocked access to dst_rq->curr in task_numa_compare() is racy.
      If curr task is exiting this may be a reason of use-after-free:
      
      task_numa_compare()                    do_exit()
          ...                                        current->flags |= PF_EXITING;
          ...                                    release_task()
          ...                                        ~~delayed_put_task_struct()~~
          ...                                    schedule()
          rcu_read_lock()                        ...
          cur = ACCESS_ONCE(dst_rq->curr)        ...
              ...                                rq->curr = next;
              ...                                    context_switch()
              ...                                        finish_task_switch()
              ...                                            put_task_struct()
              ...                                                __put_task_struct()
              ...                                                    free_task_struct()
              task_numa_assign()                                     ...
                  get_task_struct()                                  ...
      
      As noted by Oleg:
      
        <<The lockless get_task_struct(tsk) is only safe if tsk == current
          and didn't pass exit_notify(), or if this tsk was found on a rcu
          protected list (say, for_each_process() or find_task_by_vpid()).
          IOW, it is only safe if release_task() was not called before we
          take rcu_read_lock(), in this case we can rely on the fact that
          delayed_put_pid() can not drop the (potentially) last reference
          until rcu_read_unlock().
      
          And as Kirill pointed out task_numa_compare()->task_numa_assign()
          path does get_task_struct(dst_rq->curr) and this is not safe. The
          task_struct itself can't go away, but rcu_read_lock() can't save
          us from the final put_task_struct() in finish_task_switch(); this
          reference goes away without rcu gp>>
      
      The patch provides simple check of PF_EXITING flag. If it's not set,
      this guarantees that call_rcu() of delayed_put_task_struct() callback
      hasn't happened yet, so we can safely do get_task_struct() in
      task_numa_assign().
      
      Locked dst_rq->lock protects from concurrency with the last schedule().
      Reusing or unmapping of cur's memory may happen without it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1effd9f1
    • J
      sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer() · aee38ea9
      Juri Lelli 提交于
      dl_task_timer() is racy against several paths. Daniel noticed that
      the replenishment timer may experience a race condition against an
      enqueue_dl_entity() called from rt_mutex_setprio(). With his own
      words:
      
       rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
       start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
       sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
       throttled is 0
      
      => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
      enqueued on the -deadline runqueue.
      
      As we do for the other races, we just bail out in the replenishment
      timer code.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aee38ea9
    • J
      sched/deadline: Don't replenish from a !SCHED_DEADLINE entity · 64be6f1f
      Juri Lelli 提交于
      In the deboost path, right after the dl_boosted flag has been
      reset, we can currently end up replenishing using -deadline
      parameters of a !SCHED_DEADLINE entity. This of course causes
      a bug, as those parameters are empty.
      
      In the case depicted above it is safe to simply bail out, as
      the deboosted task is going to be back to its original scheduling
      class anyway.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64be6f1f
    • K
      sched: Fix race between task_group and sched_task_group · eeb61e53
      Kirill Tkhai 提交于
      The race may happen when somebody is changing task_group of a forking task.
      Child's cgroup is the same as parent's after dup_task_struct() (there just
      memory copying). Also, cfs_rq and rt_rq are the same as parent's.
      
      But if parent changes its task_group before it's called cgroup_post_fork(),
      we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
      the same, while child's task_group changes in cgroup_post_fork().
      
      To fix this we introduce fork() method, which calls sched_move_task() directly.
      This function changes sched_task_group on appropriate (also its logic has
      no problem with freshly created tasks, so we shouldn't introduce something
      special; we are able just to use it).
      
      Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eeb61e53
  2. 23 10月, 2014 1 次提交
  3. 22 10月, 2014 4 次提交
    • M
      PM: convert do_each_thread to for_each_process_thread · a28e785a
      Michal Hocko 提交于
      as per 0c740d0a (introduce for_each_thread() to replace the buggy
      while_each_thread()) get rid of do_each_thread { } while_each_thread()
      construct and replace it by a more error prone for_each_thread.
      
      This patch doesn't introduce any user visible change.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a28e785a
    • M
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko 提交于
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
    • C
      freezer: remove obsolete comments in __thaw_task() · c05eb32f
      Cong Wang 提交于
      __thaw_task() no longer clears frozen flag since commit a3201227
      (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c05eb32f
    • C
      freezer: Do not freeze tasks killed by OOM killer · 51fae6da
      Cong Wang 提交于
      Since f660daac (oom: thaw threads if oom killed thread is frozen
      before deferring) OOM killer relies on being able to thaw a frozen task
      to handle OOM situation but a3201227 (freezer: make freezing() test
      freeze conditions in effect instead of TIF_FREEZE) has reorganized the
      code and stopped clearing freeze flag in __thaw_task. This means that
      the target task only wakes up and goes into the fridge again because the
      freezing condition hasn't changed for it. This reintroduces the bug
      fixed by f660daac.
      
      Fix the issue by checking for TIF_MEMDIE thread flag in
      freezing_slow_path and exclude the task from freezing completely. If a
      task was already frozen it would get woken by __thaw_task from OOM killer
      and get out of freezer after rechecking freezing().
      
      Changes since v1
      - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
        as per Oleg
      - return __thaw_task into oom_scan_process_thread because
        oom_kill_process will not wake task in the fridge because it is
        sleeping uninterruptible
      
      [mhocko@suse.cz: rewrote the changelog]
      Fixes: a3201227 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
      Cc: 3.3+ <stable@vger.kernel.org> # 3.3+
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      51fae6da
  4. 19 10月, 2014 1 次提交
    • C
      futex: Ensure get_futex_key_refs() always implies a barrier · 76835b0e
      Catalin Marinas 提交于
      Commit b0c29f79 (futexes: Avoid taking the hb->lock if there's
      nothing to wake up) changes the futex code to avoid taking a lock when
      there are no waiters. This code has been subsequently fixed in commit
      11d4616b (futex: revert back to the explicit waiter counting code).
      Both the original commit and the fix-up rely on get_futex_key_refs() to
      always imply a barrier.
      
      However, for private futexes, none of the cases in the switch statement
      of get_futex_key_refs() would be hit and the function completes without
      a memory barrier as required before checking the "waiters" in
      futex_wake() -> hb_waiters_pending(). The consequence is a race with a
      thread waiting on a futex on another CPU, allowing the waker thread to
      read "waiters == 0" while the waiter thread to have read "futex_val ==
      locked" (in kernel).
      
      Without this fix, the problem (user space deadlocks) can be seen with
      Android bionic's mutex implementation on an arm64 multi-cluster system.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NMatteo Franchin <Matteo.Franchin@arm.com>
      Fixes: b0c29f79 (futexes: Avoid taking the hb->lock if there's nothing to wake up)
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76835b0e
  5. 15 10月, 2014 1 次提交
    • P
      modules, lock around setting of MODULE_STATE_UNFORMED · d3051b48
      Prarit Bhargava 提交于
      A panic was seen in the following sitation.
      
      There are two threads running on the system. The first thread is a system
      monitoring thread that is reading /proc/modules. The second thread is
      loading and unloading a module (in this example I'm using my simple
      dummy-module.ko).  Note, in the "real world" this occurred with the qlogic
      driver module.
      
      When doing this, the following panic occurred:
      
       ------------[ cut here ]------------
       kernel BUG at kernel/module.c:3739!
       invalid opcode: 0000 [#1] SMP
       Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
       CPU: 37 PID: 186343 Comm: cat Tainted: GF          O--------------   3.10.0+ #7
       Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
       task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
       RIP: 0010:[<ffffffff810d64c5>]  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
       RSP: 0018:ffff88080fa7fe18  EFLAGS: 00010246
       RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
       RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
       RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
       R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
       R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
       FS:  00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Stack:
        ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
        ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
        ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
       Call Trace:
        [<ffffffff810d666c>] m_show+0x19c/0x1e0
        [<ffffffff811e4d7e>] seq_read+0x16e/0x3b0
        [<ffffffff812281ed>] proc_reg_read+0x3d/0x80
        [<ffffffff811c0f2c>] vfs_read+0x9c/0x170
        [<ffffffff811c1a58>] SyS_read+0x58/0xb0
        [<ffffffff81605829>] system_call_fastpath+0x16/0x1b
       Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
       RIP  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
        RSP <ffff88080fa7fe18>
      
          Consider the two processes running on the system.
      
          CPU 0 (/proc/modules reader)
          CPU 1 (loading/unloading module)
      
          CPU 0 opens /proc/modules, and starts displaying data for each module by
          traversing the modules list via fs/seq_file.c:seq_open() and
          fs/seq_file.c:seq_read().  For each module in the modules list, seq_read
          does
      
                  op->start()  <-- this is a pointer to m_start()
                  op->show()   <- this is a pointer to m_show()
                  op->stop()   <-- this is a pointer to m_stop()
      
          The m_start(), m_show(), and m_stop() module functions are defined in
          kernel/module.c. The m_start() and m_stop() functions acquire and release
          the module_mutex respectively.
      
          ie) When reading /proc/modules, the module_mutex is acquired and released
          for each module.
      
          m_show() is called with the module_mutex held.  It accesses the module
          struct data and attempts to write out module data.  It is in this code
          path that the above BUG_ON() warning is encountered, specifically m_show()
          calls
      
          static char *module_flags(struct module *mod, char *buf)
          {
                  int bx = 0;
      
                  BUG_ON(mod->state == MODULE_STATE_UNFORMED);
          ...
      
          The other thread, CPU 1, in unloading the module calls the syscall
          delete_module() defined in kernel/module.c.  The module_mutex is acquired
          for a short time, and then released.  free_module() is called without the
          module_mutex.  free_module() then sets mod->state = MODULE_STATE_UNFORMED,
          also without the module_mutex.  Some additional code is called and then the
          module_mutex is reacquired to remove the module from the modules list:
      
              /* Now we can delete it from the lists */
              mutex_lock(&module_mutex);
              stop_machine(__unlink_module, mod, NULL);
              mutex_unlock(&module_mutex);
      
      This is the sequence of events that leads to the panic.
      
      CPU 1 is removing dummy_module via delete_module().  It acquires the
      module_mutex, and then releases it.  CPU 1 has NOT set dummy_module->state to
      MODULE_STATE_UNFORMED yet.
      
      CPU 0, which is reading the /proc/modules, acquires the module_mutex and
      acquires a pointer to the dummy_module which is still in the modules list.
      CPU 0 calls m_show for dummy_module.  The check in m_show() for
      MODULE_STATE_UNFORMED passed for dummy_module even though it is being
      torn down.
      
      Meanwhile CPU 1, which has been continuing to remove dummy_module without
      holding the module_mutex, now calls free_module() and sets
      dummy_module->state to MODULE_STATE_UNFORMED.
      
      CPU 0 now calls module_flags() with dummy_module and ...
      
      static char *module_flags(struct module *mod, char *buf)
      {
              int bx = 0;
      
              BUG_ON(mod->state == MODULE_STATE_UNFORMED);
      
      and BOOM.
      
      Acquire and release the module_mutex lock around the setting of
      MODULE_STATE_UNFORMED in the teardown path, which should resolve the
      problem.
      
      Testing: In the unpatched kernel I can panic the system within 1 minute by
      doing
      
      while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done
      
      and
      
      while (true) do cat /proc/modules; done
      
      in separate terminals.
      
      In the patched kernel I was able to run just over one hour without seeing
      any issues.  I also verified the output of panic via sysrq-c and the output
      of /proc/modules looks correct for all three states for the dummy_module.
      
              dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
              dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
              dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      d3051b48
  6. 14 10月, 2014 9 次提交
  7. 11 10月, 2014 4 次提交
  8. 10 10月, 2014 8 次提交
    • S
      kernel/sys.c: compat sysinfo syscall: fix undefined behavior · 0baae41e
      Scotty Bauer 提交于
      Fix undefined behavior and compiler warning by replacing right shift 32
      with upper_32_bits macro
      Signed-off-by: NScotty Bauer <sbauer@eng.utah.edu>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0baae41e
    • V
      kernel/sys.c: whitespace fixes · ec94fc3d
      vishnu.ps 提交于
      Fix minor errors and warning messages in kernel/sys.c.  These errors were
      reported by checkpatch while working with some modifications in sys.c
      file.  Fixing this first will help me to improve my further patches.
      
      ERROR: trailing whitespace - 9
      ERROR: do not use assignment in if condition - 4
      ERROR: spaces required around that '?' (ctx:VxO) - 10
      ERROR: switch and case should be at the same indent - 3
      
      total 26 errors & 3 warnings fixed.
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec94fc3d
    • Y
      acct: eliminate compile warning · 067b722f
      Ying Xue 提交于
      If ACCT_VERSION is not defined to 3, below warning appears:
        CC      kernel/acct.o
        kernel/acct.c: In function `do_acct_process':
        kernel/acct.c:475:24: warning: unused variable `ns' [-Wunused-variable]
      
      [akpm@linux-foundation.org: retain the local for code size improvements
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      067b722f
    • I
      kernel/async.c: switch to pr_foo() · 27fb10ed
      Ionut Alexa 提交于
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27fb10ed
    • S
      mm: use VM_BUG_ON_MM where possible · 96dad67f
      Sasha Levin 提交于
      Dump the contents of the relevant struct_mm when we hit the bug condition.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96dad67f
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • J
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner 提交于
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • C
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov 提交于
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAndrew Vagin <avagin@openvz.org>
      Tested-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f