1. 13 1月, 2014 3 次提交
    • D
      sched/deadline: Add SCHED_DEADLINE inheritance logic · 2d3d891d
      Dario Faggioli 提交于
      Some method to deal with rt-mutexes and make sched_dl interact with
      the current PI-coded is needed, raising all but trivial issues, that
      needs (according to us) to be solved with some restructuring of
      the pi-code (i.e., going toward a proxy execution-ish implementation).
      
      This is under development, in the meanwhile, as a temporary solution,
      what this commits does is:
      
       - ensure a pi-lock owner with waiters is never throttled down. Instead,
         when it runs out of runtime, it immediately gets replenished and it's
         deadline is postponed;
      
       - the scheduling parameters (relative deadline and default runtime)
         used for that replenishments --during the whole period it holds the
         pi-lock-- are the ones of the waiting task with earliest deadline.
      
      Acting this way, we provide some kind of boosting to the lock-owner,
      still by using the existing (actually, slightly modified by the previous
      commit) pi-architecture.
      
      We would stress the fact that this is only a surely needed, all but
      clean solution to the problem. In the end it's only a way to re-start
      discussion within the community. So, as always, comments, ideas, rants,
      etc.. are welcome! :-)
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Added !RT_MUTEXES build fix. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d3d891d
    • P
      rtmutex: Turn the plist into an rb-tree · fb00aca4
      Peter Zijlstra 提交于
      Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
      and provide a proper comparison function for -deadline and
      -priority tasks.
      
      This is done mainly because:
       - classical prio field of the plist is just an int, which might
         not be enough for representing a deadline;
       - manipulating such a list would become O(nr_deadline_tasks),
         which might be to much, as the number of -deadline task increases.
      
      Therefore, an rb-tree is used, and tasks are queued in it according
      to the following logic:
       - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
         one with the higher (lower, actually!) prio wins;
       - among a -priority and a -deadline task, the latter always wins;
       - among two -deadline tasks, the one with the earliest deadline
         wins.
      
      Queueing and dequeueing functions are changed accordingly, for both
      the list of a task's pi-waiters and the list of tasks blocked on
      a pi-lock.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-again-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb00aca4
    • D
      sched/deadline: Add SCHED_DEADLINE structures & implementation · aab03e05
      Dario Faggioli 提交于
      Introduces the data structures, constants and symbols needed for
      SCHED_DEADLINE implementation.
      
      Core data structure of SCHED_DEADLINE are defined, along with their
      initializers. Hooks for checking if a task belong to the new policy
      are also added where they are needed.
      
      Adds a scheduling class, in sched/dl.c and a new policy called
      SCHED_DEADLINE. It is an implementation of the Earliest Deadline
      First (EDF) scheduling algorithm, augmented with a mechanism (called
      Constant Bandwidth Server, CBS) that makes it possible to isolate
      the behaviour of tasks between each other.
      
      The typical -deadline task will be made up of a computation phase
      (instance) which is activated on a periodic or sporadic fashion. The
      expected (maximum) duration of such computation is called the task's
      runtime; the time interval by which each instance need to be completed
      is called the task's relative deadline. The task's absolute deadline
      is dynamically calculated as the time instant a task (better, an
      instance) activates plus the relative deadline.
      
      The EDF algorithms selects the task with the smallest absolute
      deadline as the one to be executed first, while the CBS ensures each
      task to run for at most its runtime every (relative) deadline
      length time interval, avoiding any interference between different
      tasks (bandwidth isolation).
      Thanks to this feature, also tasks that do not strictly comply with
      the computational model sketched above can effectively use the new
      policy.
      
      To summarize, this patch:
       - introduces the data structures, constants and symbols needed;
       - implements the core logic of the scheduling algorithm in the new
         scheduling class file;
       - provides all the glue code between the new scheduling class and
         the core scheduler and refines the interactions between sched/dl
         and the other existing scheduling classes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <michael@amarulasolutions.com>
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aab03e05
  2. 19 12月, 2013 1 次提交
    • R
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel 提交于
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20841405
  3. 27 11月, 2013 1 次提交
  4. 15 11月, 2013 2 次提交
    • K
      mm: implement split page table lock for PMD level · e009bb30
      Kirill A. Shutemov 提交于
      The basic idea is the same as with PTE level: the lock is embedded into
      struct page of table's page.
      
      We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
      take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
      for that.
      
      pgtable_pmd_page_ctor() returns true, if initialization is successful
      and false otherwise.  Current implementation never fails, but assumption
      that constructor can fail will help to port it to -rt where spinlock_t
      is rather huge and cannot be embedded into struct page -- dynamic
      allocation is required.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e009bb30
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
  5. 30 10月, 2013 2 次提交
    • O
      uprobes: Teach uprobe_copy_process() to handle CLONE_VFORK · 3ab67966
      Oleg Nesterov 提交于
      uprobe_copy_process() does nothing if the child shares ->mm with
      the forking process, but there is a special case: CLONE_VFORK.
      In this case it would be more correct to do dup_utask() but avoid
      dup_xol(). This is not that important, the child should not unwind
      its stack too much, this can corrupt the parent's stack, but at
      least we need this to allow to ret-probe __vfork() itself.
      
      Note: in theory, it would be better to check task_pt_regs(p)->sp
      instead of CLONE_VFORK, we need to dup_utask() if and only if the
      child can return from the function called by the parent. But this
      needs the arch-dependant helper, and I think that nobody actually
      does clone(same_stack, CLONE_VM).
      Reported-by: NMartin Cermak <mcermak@redhat.com>
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      3ab67966
    • O
      uprobes: Change the callsite of uprobe_copy_process() · b68e0749
      Oleg Nesterov 提交于
      Preparation for the next patches.
      
      Move the callsite of uprobe_copy_process() in copy_process() down
      to the succesfull return. We do not care if copy_process() fails,
      uprobe_free_utask() won't be called in this case so the wrong
      ->utask != NULL doesn't matter.
      
      OTOH, with this change we know that copy_process() can't fail when
      uprobe_copy_process() is called, the new task should either return
      to user-mode or call do_exit(). This way uprobe_copy_process() can:
      
      	1. setup p->utask != NULL if necessary
      
      	2. setup uprobes_state.xol_area
      
      	3. use task_work_add(p)
      
      Also, move the definition of uprobe_copy_process() down so that it
      can see get_utask().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      b68e0749
  6. 09 10月, 2013 2 次提交
  7. 12 9月, 2013 4 次提交
  8. 31 8月, 2013 1 次提交
    • E
      pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD · 6e556ce2
      Eric W. Biederman 提交于
      I goofed when I made unshare(CLONE_NEWPID) only work in a
      single-threaded process.  There is no need for that requirement and in
      fact I analyzied things right for setns.  The hard requirement
      is for tasks that share a VM to all be in the pid namespace and
      we properly prevent that in do_fork.
      
      Just to be certain I took a look through do_wait and
      forget_original_parent and there are no cases that make it any harder
      for children to be in the multiple pid namespaces than it is for
      children to be in the same pid namespace.  I also performed a check to
      see if there were in uses of task->nsproxy_pid_ns I was not familiar
      with, but it is only used when allocating a new pid for a new task,
      and in checks to prevent craziness from happening.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6e556ce2
  9. 28 8月, 2013 1 次提交
  10. 14 8月, 2013 1 次提交
  11. 31 7月, 2013 1 次提交
    • B
      aio: convert the ioctx list to table lookup v3 · db446a08
      Benjamin LaHaise 提交于
      On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
      > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
      > > When using a large number of threads performing AIO operations the
      > > IOCTX list may get a significant number of entries which will cause
      > > significant overhead. For example, when running this fio script:
      > >
      > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=512; thread; loops=100
      > >
      > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
      > > 30% CPU time spent by lookup_ioctx:
      > >
      > >  32.51%  [guest.kernel]  [g] lookup_ioctx
      > >   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
      > >   4.40%  [guest.kernel]  [g] lock_release
      > >   4.19%  [guest.kernel]  [g] sched_clock_local
      > >   3.86%  [guest.kernel]  [g] local_clock
      > >   3.68%  [guest.kernel]  [g] native_sched_clock
      > >   3.08%  [guest.kernel]  [g] sched_clock_cpu
      > >   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
      > >   2.60%  [guest.kernel]  [g] memcpy
      > >   2.33%  [guest.kernel]  [g] lock_acquired
      > >   2.25%  [guest.kernel]  [g] lock_acquire
      > >   1.84%  [guest.kernel]  [g] do_io_submit
      > >
      > > This patchs converts the ioctx list to a radix tree. For a performance
      > > comparison the above FIO script was run on a 2 sockets 8 core
      > > machine. This are the results (average and %rsd of 10 runs) for the
      > > original list based implementation and for the radix tree based
      > > implementation:
      > >
      > > cores         1         2         4         8         16        32
      > > list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
      > > %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
      > > radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
      > > %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
      > > % of radix
      > > relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
      > > to list
      > >
      > > To consider the impact of the patch on the typical case of having
      > > only one ctx per process the following FIO script was run:
      > >
      > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
      > > blocksize=1024; numjobs=1; thread; loops=100
      > >
      > > on the same system and the results are the following:
      > >
      > > list        58892 ms
      > > %rsd         0.91%
      > > radix       59404 ms
      > > %rsd         0.81%
      > > % of radix
      > > relative    100.87%
      > > to list
      >
      > So, I was just doing some benchmarking/profiling to get ready to send
      > out the aio patches I've got for 3.11 - and it looks like your patch is
      > causing a ~1.5% throughput regression in my testing :/
      ... <snip>
      
      I've got an alternate approach for fixing this wart in lookup_ioctx()...
      Instead of using an rbtree, just use the reserved id in the ring buffer
      header to index an array pointing the ioctx.  It's not finished yet, and
      it needs to be tidied up, but is most of the way there.
      
      		-ben
      --
      "Thought is the essence of where you are now."
      --
      kmo> And, a rework of Ben's code, but this was entirely his idea
      kmo>		-Kent
      
      bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
      free memory.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      db446a08
  12. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  13. 11 7月, 2013 1 次提交
  14. 04 7月, 2013 4 次提交
    • O
      kernel/fork.c:copy_process(): consolidate the lockless CLONE_THREAD checks · 18c830df
      Oleg Nesterov 提交于
      copy_process() does a lot of "chaotic" initializations and checks
      CLONE_THREAD twice before it takes tasklist.  In particular it sets
      "p->group_leader = p" and then changes it again under tasklist if
      !thread_group_leader(p).
      
      This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block
      which initializes ->exit_signal, ->group_leader, and ->tgid.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18c830df
    • O
      kernel/fork.c:copy_process(): don't add the uninitialized child to thread/task/pid lists · 81907739
      Oleg Nesterov 提交于
      copy_process() adds the new child to thread_group/init_task.tasks list and
      then does attach_pid(child, PIDTYPE_PID).  This means that the lockless
      next_thread() or next_task() can see this thread with the wrong pid.  Say,
      "ls /proc/pid/task" can list the same inode twice.
      
      We could move attach_pid(child, PIDTYPE_PID) up, but in this case
      find_task_by_vpid() can find the new thread before it was fully
      initialized.
      
      And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch
      copy_process() initializes child->pids[*].pid first, then calls
      attach_pid() to insert the task into the pid->tasks list.
      
      attach_pid() no longer need the "struct pid*" argument, it is always
      called after pid_link->pid was already set.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81907739
    • O
      kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code · 80628ca0
      Oleg Nesterov 提交于
      Cleanup and preparation for the next changes.
      
      Move the "if (clone_flags & CLONE_THREAD)" code down under "if
      (likely(p->pid))" and turn it into into the "else" branch.  This makes the
      process/thread initialization more symmetrical and removes one check.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80628ca0
    • E
      fork: reorder permissions when violating number of processes limits · b57922b6
      Eric Paris 提交于
      When a task is attempting to violate the RLIMIT_NPROC limit we have a
      check to see if the task is sufficiently priviledged.  The check first
      looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0.
      
      A result is that tasks which are allowed by the uid=0 check are first
      checked against the security subsystem.  This results in the security
      subsystem auditting a denial for sys_admin and sys_resource and then the
      task passing the uid=0 check.
      
      This patch rearranges the code to first check uid=0, since if we pass that
      we shouldn't hit the security system at all.  We then check sys_resource,
      since it is the smallest capability which will solve the problem.  Lastly
      we check the fallback everything cap_sysadmin.  We don't want to give this
      capability many places since it is so powerful.
      
      This will eliminate many of the false positive/needless denial messages we
      get when a root task tries to violate the nproc limit.  (note that
      kthreads count against root, so on a sufficiently large machine we can
      actually get past the default limits before any userspace tasks are
      launched.)
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b57922b6
  15. 08 5月, 2013 1 次提交
  16. 24 3月, 2013 1 次提交
  17. 14 3月, 2013 1 次提交
    • E
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman 提交于
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66eded8
  18. 08 3月, 2013 1 次提交
    • F
      cputime: Dynamically scale cputime for full dynticks accounting · 9fbc42ea
      Frederic Weisbecker 提交于
      The full dynticks cputime accounting is able to account either
      using the tick or the context tracking subsystem. This way
      the housekeeping CPU can keep the low overhead tick based
      solution.
      
      This latter mode has a low jiffies resolution granularity and
      need to be scaled against CFS precise runtime accounting to
      improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
      now we also need to expand it to full dynticks accounting dynamic
      off-case as well.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      9fbc42ea
  19. 04 3月, 2013 1 次提交
  20. 28 2月, 2013 1 次提交
  21. 23 2月, 2013 1 次提交
  22. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  23. 20 1月, 2013 1 次提交
  24. 25 12月, 2012 1 次提交
  25. 20 12月, 2012 1 次提交
  26. 19 12月, 2012 1 次提交
    • G
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa 提交于
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
  27. 11 12月, 2012 1 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
  28. 29 11月, 2012 2 次提交