1. 03 5月, 2012 6 次提交
  2. 26 4月, 2012 2 次提交
    • E
      userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8
      Eric W. Biederman 提交于
      - Convert the old uid mapping functions into compatibility wrappers
      - Add a uid/gid mapping layer from user space uid and gids to kernel
        internal uids and gids that is extent based for simplicty and speed.
        * Working with number space after mapping uids/gids into their kernel
          internal version adds only mapping complexity over what we have today,
          leaving the kernel code easy to understand and test.
      - Add proc files /proc/self/uid_map /proc/self/gid_map
        These files display the mapping and allow a mapping to be added
        if a mapping does not exist.
      - Allow entering the user namespace without a uid or gid mapping.
        Since we are starting with an existing user our uids and gids
        still have global mappings so are still valid and useful they just don't
        have local mappings.  The requirement for things to work are global uid
        and gid so it is odd but perfectly fine not to have a local uid
        and gid mapping.
        Not requiring global uid and gid mappings greatly simplifies
        the logic of setting up the uid and gid mappings by allowing
        the mappings to be set after the namespace is created which makes the
        slight weirdness worth it.
      - Make the mappings in the initial user namespace to the global
        uid/gid space explicit.  Today it is an identity mapping
        but in the future we may want to twist this for debugging, similar
        to what we do with jiffies.
      - Document the memory ordering requirements of setting the uid and
        gid mappings.  We only allow the mappings to be set once
        and there are no pointers involved so the requirments are
        trivial but a little atypical.
      
      Performance:
      
      In this scheme for the permission checks the performance is expected to
      stay the same as the actuall machine instructions should remain the same.
      
      The worst case I could think of is ls -l on a large directory where
      all of the stat results need to be translated with from kuids and
      kgids to uids and gids.  So I benchmarked that case on my laptop
      with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
      
      My benchmark consisted of going to single user mode where nothing else
      was running. On an ext4 filesystem opening 1,000,000 files and looping
      through all of the files 1000 times and calling fstat on the
      individuals files.  This was to ensure I was benchmarking stat times
      where the inodes were in the kernels cache, but the inode values were
      not in the processors cache.  My results:
      
      v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      All of the configurations ran in roughly 120ns when I performed tests
      that ran in the cpu cache.
      
      So in summary the performance impact is:
      1ns improvement in the worst case with user namespace support compiled out.
      8ns aka 5% slowdown in the worst case with user namespace support compiled in.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      22d917d8
    • E
      userns: Simplify the user_namespace by making userns->creator a kuid. · 783291e6
      Eric W. Biederman 提交于
      - Transform userns->creator from a user_struct reference to a simple
        kuid_t, kgid_t pair.
      
        In cap_capable this allows the check to see if we are the creator of
        a namespace to become the classic suser style euid permission check.
      
        This allows us to remove the need for a struct cred in the mapping
        functions and still be able to dispaly the user namespace creators
        uid and gid as 0.
      
      - Remove the now unnecessary delayed_work in free_user_ns.
      
        All that is left for free_user_ns to do is to call kmem_cache_free
        and put_user_ns.  Those functions can be called in any context
        so call them directly from free_user_ns removing the need for delayed work.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      783291e6
  3. 08 4月, 2012 9 次提交
  4. 31 3月, 2012 2 次提交
  5. 30 3月, 2012 2 次提交
  6. 29 3月, 2012 15 次提交
    • K
      futex: Mark get_robust_list as deprecated · ec0c4274
      Kees Cook 提交于
      Notify get_robust_list users that the syscall is going away.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ec0c4274
    • K
      futex: Do not leak robust list to unprivileged process · bdbb776f
      Kees Cook 提交于
      It was possible to extract the robust list head address from a setuid
      process if it had used set_robust_list(), allowing an ASLR info leak. This
      changes the permission checks to be the same as those used for similar
      info that comes out of /proc.
      
      Running a setuid program that uses robust futexes would have had:
        cred->euid != pcred->euid
        cred->euid == pcred->uid
      so the old permissions check would allow it. I'm not aware of any setuid
      programs that use robust futexes, so this is just a preventative measure.
      
      (This patch is based on changes from grsecurity.)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bdbb776f
    • P
      genirq: Respect NUMA node affinity in setup_irq_irq affinity() · 241fc640
      Prarit Bhargava 提交于
      We respect node affinity of devices already in the irq descriptor
      allocation, but we ignore it for the initial interrupt affinity
      setup, so the interrupt might be routed to a different node.
      
      Restrict the default affinity mask to the node on which the irq
      descriptor is allocated.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      241fc640
    • A
      genirq: Get rid of unneeded force parameter in irq_finalize_oneshot() · f3f79e38
      Alexander Gordeev 提交于
      The only place irq_finalize_oneshot() is called with force parameter set
      is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
      at this point and irq_wake_thread() is not going to set it again,
      since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
      will drop the threads bit in threads_oneshot anyway and hence the force
      parameter is superfluous.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f3f79e38
    • A
      genirq: Minor readablity improvement in irq_wake_thread() · 69592db2
      Alexander Gordeev 提交于
      exit_irq_thread() clears IRQTF_RUNTHREAD flag and drops the thread's bit in
      desc->threads_oneshot then. The bit must not be set again in between and it
      does not, since irq_wake_thread() sees PF_EXITING flag first and returns.
      
      Due to above the order or checking PF_EXITING and IRQTF_RUNTHREAD flags in
      irq_wake_thread() is important. This change just makes it more visible in the
      source code.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162212.GO24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      69592db2
    • S
      sched: Fix __schedule_bug() output when called from an interrupt · 6135fc1e
      Stephen Boyd 提交于
      If schedule is called from an interrupt handler __schedule_bug()
      will call show_regs() with the registers saved during the
      interrupt handling done in do_IRQ(). This means we'll see the
      registers and the backtrace for the process that was interrupted
      and not the full backtrace explaining who called schedule().
      
      This is due to 838225b4 ("sched: use show_regs() to improve
      __schedule_bug() output", 2007-10-24) which improperly assumed
      that get_irq_regs() would return the registers for the current
      stack because it is being called from within an interrupt
      handler. Simply remove the show_reg() code so that we dump a
      backtrace for the interrupt handler that called schedule().
      
      [ I ran across this when I was presented with a scheduling while
        atomic log with a stacktrace pointing at spin_unlock_irqrestore().
        It made no sense and I had to guess what interrupt handler could
        be called and poke around for someone calling schedule() in an
        interrupt handler. A simple test of putting an msleep() in
        an interrupt handler works better with this patch because you
        can actually see the msleep() call in the backtrace. ]
      Also-reported-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Satyam Sharma <satyam@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6135fc1e
    • D
      pidns: add reboot_pid_ns() to handle the reboot syscall · cf3f8921
      Daniel Lezcano 提交于
      In the case of a child pid namespace, rebooting the system does not really
      makes sense.  When the pid namespace is used in conjunction with the other
      namespaces in order to create a linux container, the reboot syscall leads
      to some problems.
      
      A container can reboot the host.  That can be fixed by dropping the
      sys_reboot capability but we are unable to correctly to poweroff/
      halt/reboot a container and the container stays stuck at the shutdown time
      with the container's init process waiting indefinitively.
      
      After several attempts, no solution from userspace was found to reliabily
      handle the shutdown from a container.
      
      This patch propose to make the init process of the child pid namespace to
      exit with a signal status set to : SIGINT if the child pid namespace
      called "halt/poweroff" and SIGHUP if the child pid namespace called
      "reboot".  When the reboot syscall is called and we are not in the initial
      pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
      "RESTART", and "RESTART2".  Otherwise we return EINVAL.
      
      Returning EINVAL is also an easy way to check if this feature is supported
      by the kernel when invoking another 'reboot' option like CAD.
      
      By this way the parent process of the child pid namespace knows if it
      rebooted or not and can take the right decision.
      
      Test case:
      ==========
      
      #include <alloca.h>
      #include <stdio.h>
      #include <sched.h>
      #include <unistd.h>
      #include <signal.h>
      #include <sys/reboot.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      
      #include <linux/reboot.h>
      
      static int do_reboot(void *arg)
      {
              int *cmd = arg;
      
              if (reboot(*cmd))
                      printf("failed to reboot(%d): %m\n", *cmd);
      }
      
      int test_reboot(int cmd, int sig)
      {
              long stack_size = 4096;
              void *stack = alloca(stack_size) + stack_size;
              int status;
              pid_t ret;
      
              ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
              if (ret < 0) {
                      printf("failed to clone: %m\n");
                      return -1;
              }
      
              if (wait(&status) < 0) {
                      printf("unexpected wait error: %m\n");
                      return -1;
              }
      
              if (!WIFSIGNALED(status)) {
                      printf("child process exited but was not signaled\n");
                      return -1;
              }
      
              if (WTERMSIG(status) != sig) {
                      printf("signal termination is not the one expected\n");
                      return -1;
              }
      
              return 0;
      }
      
      int main(int argc, char *argv[])
      {
              int status;
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
              if (status >= 0) {
                      printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
                      return 1;
              }
              printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");
      
              return 0;
      }
      
      [akpm@linux-foundation.org: tweak and add comments]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3f8921
    • A
      sysctl: use bitmap library functions · 5a04cca6
      Akinobu Mita 提交于
      Use bitmap_set() instead of using set_bit() for each bit.  This conversion
      is valid because the bitmap is private in the function call and atomic
      bitops were unnecessary.
      
      This also includes minor change.
      - Use bitmap_copy() for shorter typing
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a04cca6
    • Z
      kexec: add further check to crashkernel · eaa3be6a
      Zhenzhong Duan 提交于
      When using crashkernel=2M-256M, the kernel doesn't give any warning.  This
      is misleading sometimes.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eaa3be6a
    • W
      kexec: crash: don't save swapper_pg_dir for !CONFIG_MMU configurations · d034cfab
      Will Deacon 提交于
      nommu platforms don't have very interesting swapper_pg_dir pointers and
      usually just #define them to NULL, meaning that we can't include them in
      the vmcoreinfo on the kexec crash path.
      
      This patch only saves the swapper_pg_dir if we have an MMU.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d034cfab
    • G
      smp: add func to IPI cpus based on parameter func · b3a7e98e
      Gilad Ben-Yossef 提交于
      Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
      calculates the cpumask of cpus to IPI by calling a function supplied as a
      parameter in order to determine whether to IPI each specific cpu.
      
      The function works around allocation failure of cpumask variable in
      CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
      via smp_call_function_single().
      
      The function is useful since it allows to seperate the specific code that
      decided in each case whether to IPI a specific cpu for a specific request
      from the common boilerplate code of handling creating the mask, handling
      failures etc.
      
      [akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
      [akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
      [akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Reviewed-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3a7e98e
    • G
      smp: introduce a generic on_each_cpu_mask() function · 3fc498f1
      Gilad Ben-Yossef 提交于
      We have lots of infrastructure in place to partition multi-core systems
      such that we have a group of CPUs that are dedicated to specific task:
      cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
      Still, kernel code will at times interrupt all CPUs in the system via IPIs
      for various needs.  These IPIs are useful and cannot be avoided
      altogether, but in certain cases it is possible to interrupt only specific
      CPUs that have useful work to do and not the entire system.
      
      This patch set, inspired by discussions with Peter Zijlstra and Frederic
      Weisbecker when testing the nohz task patch set, is a first stab at trying
      to explore doing this by locating the places where such global IPI calls
      are being made and turning the global IPI into an IPI for a specific group
      of CPUs.  The purpose of the patch set is to get feedback if this is the
      right way to go for dealing with this issue and indeed, if the issue is
      even worth dealing with at all.  Based on the feedback from this patch set
      I plan to offer further patches that address similar issue in other code
      paths.
      
      This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
      infrastructure API (the former derived from existing arch specific
      versions in Tile and Arm) and uses them to turn several global IPI
      invocation to per CPU group invocations.
      
      Core kernel:
      
      on_each_cpu_mask() calls a function on processors specified by cpumask,
      which may or may not include the local processor.
      
      You must not call this function with disabled interrupts or from a
      hardware interrupt handler or from a bottom half handler.
      
      arch/arm:
      
      Note that the generic version is a little different then the Arm one:
      
      1. It has the mask as first parameter
      2. It calls the function on the calling CPU with interrupts disabled,
         but this should be OK since the function is called on the other CPUs
         with interrupts disabled anyway.
      
      arch/tile:
      
      The API is the same as the tile private one, but the generic version
      also calls the function on the with interrupts disabled in UP case
      
      This is OK since the function is called on the other CPUs
      with interrupts disabled.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc498f1
    • D
      Remove all #inclusions of asm/system.h · 9ffc93f2
      David Howells 提交于
      Remove all #inclusions of asm/system.h preparatory to splitting and killing
      it.  Performed with the following command:
      
      perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9ffc93f2
    • D
      Add #includes needed to permit the removal of asm/system.h · 96f951ed
      David Howells 提交于
      asm/system.h is a cause of circular dependency problems because it contains
      commonly used primitive stuff like barrier definitions and uncommonly used
      stuff like switch_to() that might require MMU definitions.
      
      asm/system.h has been disintegrated by this point on all arches into the
      following common segments:
      
       (1) asm/barrier.h
      
           Moved memory barrier definitions here.
      
       (2) asm/cmpxchg.h
      
           Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.
      
       (3) asm/bug.h
      
           Moved die() and similar here.
      
       (4) asm/exec.h
      
           Moved arch_align_stack() here.
      
       (5) asm/elf.h
      
           Moved AT_VECTOR_SIZE_ARCH here.
      
       (6) asm/switch_to.h
      
           Moved switch_to() here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      96f951ed
    • D
      Disintegrate asm/system.h for Sparc · d550bbd4
      David Howells 提交于
      Disintegrate asm/system.h for Sparc.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: sparclinux@vger.kernel.org
      d550bbd4
  7. 28 3月, 2012 2 次提交
  8. 27 3月, 2012 2 次提交
    • M
      sched/rt: Improve pick_next_highest_task_rt() · 1b028abc
      Michael J Wang 提交于
      Avoid extra work by continuing on to the next rt_rq if the highest
      prio task in current rt_rq is the same priority as our candidate
      task.
      
      More detailed explanation:  if next is not NULL, then we have found a
      candidate task, and its priority is next->prio.  Now we are looking
      for an even higher priority task in the other rt_rq's.  idx is the
      highest priority in the current candidate rt_rq.  In the current 3.3
      code, if idx is equal to next->prio, we would start scanning the tasks
      in that rt_rq and replace the current candidate task with a task from
      that rt_rq.  But the new task would only have a priority that is equal
      to our previous candidate task, so we have not advanced our goal of
      finding a higher prio task.  So we should avoid the extra work by
      continuing on to the next rt_rq if idx is equal to next->prio.
      Signed-off-by: NMichael J Wang <mjwang@broadcom.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1b028abc
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9