1. 24 4月, 2012 1 次提交
  2. 12 4月, 2012 1 次提交
  3. 02 4月, 2012 12 次提交
    • T
      cgroup: make css->refcnt clearing on cgroup removal optional · 48ddbe19
      Tejun Heo 提交于
      Currently, cgroup removal tries to drain all css references.  If there
      are active css references, the removal logic waits and retries
      ->pre_detroy() until either all refs drop to zero or removal is
      cancelled.
      
      This semantics is unusual and adds non-trivial complexity to cgroup
      core and IMHO is fundamentally misguided in that it couples internal
      implementation details (references to internal data structure) with
      externally visible operation (rmdir).  To userland, this is a behavior
      peculiarity which is unnecessary and difficult to expect (css refs is
      otherwise invisible from userland), and, to policy implementations,
      this is an unnecessary restriction (e.g. blkcg wants to hold css refs
      for caching purposes but can't as that becomes visible as rmdir hang).
      
      Unfortunately, memcg currently depends on ->pre_destroy() retrials and
      cgroup removal vetoing and can't be immmediately switched to the new
      behavior.  This patch introduces the new behavior of not waiting for
      css refs to drain and maintains the old behavior for subsystems which
      have __DEPRECATED_clear_css_refs set.
      
      Once, memcg is updated, we can drop the code paths for the old
      behavior as proposed in the following patch.  Note that the following
      patch is incorrect in that dput work item is in cgroup and may lose
      some of dputs when multiples css's are released back-to-back, and
      __css_put() triggers check_for_release() when refcnt reaches 0 instead
      of 1; however, it shows what part can be removed.
      
        http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
      
      Note that, in not-too-distant future, cgroup core will start emitting
      warning messages for subsys which require the old behavior, so please
      get moving.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      48ddbe19
    • T
      cgroup: use negative bias on css->refcnt to block css_tryget() · 28b4c27b
      Tejun Heo 提交于
      When a cgroup is about to be removed, cgroup_clear_css_refs() is
      called to check and ensure that there are no active css references.
      
      This is currently achieved by dropping the refcnt to zero iff it has
      only the base ref.  If all css refs could be dropped to zero, ref
      clearing is successful and CSS_REMOVED is set on all css.  If not, the
      base ref is restored.  While css ref is zero w/o CSS_REMOVED set, any
      css_tryget() attempt on it busy loops so that they are atomic
      w.r.t. the whole css ref clearing.
      
      This does work but dropping and re-instating the base ref is somewhat
      hairy and makes it difficult to add more logic to the put path as
      there are two of them - the regular css_put() and the reversible base
      ref clearing.
      
      This patch updates css ref clearing such that blocking new
      css_tryget() and putting the base ref are separate operations.
      CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
      css_tryget() busy loops while refcnt is negative.  After all css refs
      are deactivated, if they were all one, ref clearing succeeded and
      CSS_REMOVED is set and the base ref is put using the regular
      css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
      and the original postive values are restored.
      
      css_refcnt() accessor which always returns the unbiased positive
      reference counts is added and used to simplify refcnt usages.  While
      at it, relocate and reformat comments in cgroup_has_css_refs().
      
      This separates css->refcnt deactivation and putting the base ref,
      which enables the next patch to make ref clearing optional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      28b4c27b
    • T
      cgroup: implement cgroup_rm_cftypes() · 79578621
      Tejun Heo 提交于
      Implement cgroup_rm_cftypes() which removes an array of cftypes from a
      subsystem.  It can be called whether the target subsys is attached or
      not.  cgroup core will remove the specified file from all existing
      cgroups.
      
      This will be used to improve sub-subsys modularity and will be helpful
      for unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      79578621
    • T
      cgroup: introduce struct cfent · 05ef1d7c
      Tejun Heo 提交于
      This patch adds cfent (cgroup file entry) which is the association
      between a cgroup and a file.  This is in-cgroup representation of
      files under a cgroup directory.  This simplifies walking walking
      cgroup files and thus cgroup_clear_directory(), which is now
      implemented in two parts - cgroup_rm_file() and a loop around it.
      
      cgroup_rm_file() will be used to implement cftype removal and cfent is
      scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
      "sever" semantics).
      
      v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
            if the file had openers at the time of removal.  Moved to
            cgroup_diput().
      
          - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
            wasn't empty after removing all files.  This triggered
            spuriously if some files were open during directory clearing.
            Removed.
      
      v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
            spuriously triggered for root cgroups because they don't go
            through cgroup_clear_directory() on unmount.  Don't trigger WARN
            for root cgroups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      05ef1d7c
    • T
      cgroup: relocate __d_cgrp() and __d_cft() · f6ea9372
      Tejun Heo 提交于
      Move the two macros upwards as they'll be used earlier in the file.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      f6ea9372
    • T
      cgroup: remove cgroup_add_file[s]() · db0416b6
      Tejun Heo 提交于
      No controller is using cgroup_add_files[s]().  Unexport them, and
      convert cgroup_add_files() to handle NULL entry terminated array
      instead of taking count explicitly and continue creation on failure
      for internal use.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      db0416b6
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
    • T
      cgroup: merge cft_release_agent cftype array into the base files array · 6e6ff25b
      Tejun Heo 提交于
      Now that cftype can express whether a file should only be on root,
      cft_release_agent can be merged into the base files cftypes array.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      6e6ff25b
    • T
      cgroup: implement cgroup_add_cftypes() and friends · 8e3f6541
      Tejun Heo 提交于
      Currently, cgroup directories are populated by subsys->populate()
      callback explicitly creating files on each cgroup creation.  This
      level of flexibility isn't needed or desirable.  It provides largely
      unused flexibility which call for abuses while severely limiting what
      the core layer can do through the lack of structure and conventions.
      
      Per each cgroup file type, the only distinction that cgroup users is
      making is whether a cgroup is root or not, which can easily be
      expressed with flags.
      
      This patch introduces cgroup_add_cftypes().  These deal with cftypes
      instead of individual files - controllers indicate that certain types
      of files exist for certain subsystem.  Newly added CFTYPE_*_ON_ROOT
      flags indicate whether a cftype should be excluded or created only on
      the root cgroup.
      
      cgroup_add_cftypes() can be called any time whether the target
      subsystem is currently attached or not.  cgroup core will create files
      on the existing cgroups as necessary.
      
      Also, cgroup_subsys->base_cftypes is added to ease registration of the
      base files for the subsystem.  If non-NULL on subsys init, the cftypes
      pointed to by ->base_cftypes are automatically registered on subsys
      init / load.
      
      Further patches will convert the existing users and remove the file
      based interface.  Note that this interface allows dynamic addition of
      files to an active controller.  This will be used for sub-controller
      modularity and unified hierarchy in the longer term.
      
      This patch implements the new mechanism but doesn't apply it to any
      user.
      
      v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
          cgroup_subsys->base_cftypes, which works better for cgroup_subsys
          which is loaded as module.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      8e3f6541
    • T
      cgroup: build list of all cgroups under a given cgroupfs_root · b0ca5a84
      Tejun Heo 提交于
      Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
      going through cgroup->allcg_node.  The list is protected by
      cgroup_mutex and will be used to improve cgroup file handling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      b0ca5a84
    • T
      cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() · ff4c8d50
      Tejun Heo 提交于
      cgroup_populate_dir() currently clears all files and then repopulate
      the directory; however, the clearing part is only useful when it's
      called from cgroup_remount().  Relocate the invocation to
      cgroup_remount().
      
      This is to prepare for further cgroup file handling updates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      ff4c8d50
    • T
      cgroup: deprecate remount option changes · 8b5a5a9d
      Tejun Heo 提交于
      This patch marks the following features for deprecation.
      
      * Rebinding subsys by remount: Never reached useful state - only works
        on empty hierarchies.
      
      * release_agent update by remount: release_agent itself will be
        replaced with conventional fsnotify notification.
      
      v2: Lennart pointed out that "name=" is necessary for mounts w/o any
          controller attached.  Drop "name=" deprecation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Lennart Poettering <mzxreary@0pointer.de>
      8b5a5a9d
  4. 31 3月, 2012 2 次提交
  5. 30 3月, 2012 2 次提交
  6. 29 3月, 2012 15 次提交
    • K
      futex: Mark get_robust_list as deprecated · ec0c4274
      Kees Cook 提交于
      Notify get_robust_list users that the syscall is going away.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ec0c4274
    • K
      futex: Do not leak robust list to unprivileged process · bdbb776f
      Kees Cook 提交于
      It was possible to extract the robust list head address from a setuid
      process if it had used set_robust_list(), allowing an ASLR info leak. This
      changes the permission checks to be the same as those used for similar
      info that comes out of /proc.
      
      Running a setuid program that uses robust futexes would have had:
        cred->euid != pcred->euid
        cred->euid == pcred->uid
      so the old permissions check would allow it. I'm not aware of any setuid
      programs that use robust futexes, so this is just a preventative measure.
      
      (This patch is based on changes from grsecurity.)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bdbb776f
    • P
      genirq: Respect NUMA node affinity in setup_irq_irq affinity() · 241fc640
      Prarit Bhargava 提交于
      We respect node affinity of devices already in the irq descriptor
      allocation, but we ignore it for the initial interrupt affinity
      setup, so the interrupt might be routed to a different node.
      
      Restrict the default affinity mask to the node on which the irq
      descriptor is allocated.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      241fc640
    • A
      genirq: Get rid of unneeded force parameter in irq_finalize_oneshot() · f3f79e38
      Alexander Gordeev 提交于
      The only place irq_finalize_oneshot() is called with force parameter set
      is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
      at this point and irq_wake_thread() is not going to set it again,
      since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
      will drop the threads bit in threads_oneshot anyway and hence the force
      parameter is superfluous.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f3f79e38
    • A
      genirq: Minor readablity improvement in irq_wake_thread() · 69592db2
      Alexander Gordeev 提交于
      exit_irq_thread() clears IRQTF_RUNTHREAD flag and drops the thread's bit in
      desc->threads_oneshot then. The bit must not be set again in between and it
      does not, since irq_wake_thread() sees PF_EXITING flag first and returns.
      
      Due to above the order or checking PF_EXITING and IRQTF_RUNTHREAD flags in
      irq_wake_thread() is important. This change just makes it more visible in the
      source code.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162212.GO24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      69592db2
    • S
      sched: Fix __schedule_bug() output when called from an interrupt · 6135fc1e
      Stephen Boyd 提交于
      If schedule is called from an interrupt handler __schedule_bug()
      will call show_regs() with the registers saved during the
      interrupt handling done in do_IRQ(). This means we'll see the
      registers and the backtrace for the process that was interrupted
      and not the full backtrace explaining who called schedule().
      
      This is due to 838225b4 ("sched: use show_regs() to improve
      __schedule_bug() output", 2007-10-24) which improperly assumed
      that get_irq_regs() would return the registers for the current
      stack because it is being called from within an interrupt
      handler. Simply remove the show_reg() code so that we dump a
      backtrace for the interrupt handler that called schedule().
      
      [ I ran across this when I was presented with a scheduling while
        atomic log with a stacktrace pointing at spin_unlock_irqrestore().
        It made no sense and I had to guess what interrupt handler could
        be called and poke around for someone calling schedule() in an
        interrupt handler. A simple test of putting an msleep() in
        an interrupt handler works better with this patch because you
        can actually see the msleep() call in the backtrace. ]
      Also-reported-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Satyam Sharma <satyam@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6135fc1e
    • D
      pidns: add reboot_pid_ns() to handle the reboot syscall · cf3f8921
      Daniel Lezcano 提交于
      In the case of a child pid namespace, rebooting the system does not really
      makes sense.  When the pid namespace is used in conjunction with the other
      namespaces in order to create a linux container, the reboot syscall leads
      to some problems.
      
      A container can reboot the host.  That can be fixed by dropping the
      sys_reboot capability but we are unable to correctly to poweroff/
      halt/reboot a container and the container stays stuck at the shutdown time
      with the container's init process waiting indefinitively.
      
      After several attempts, no solution from userspace was found to reliabily
      handle the shutdown from a container.
      
      This patch propose to make the init process of the child pid namespace to
      exit with a signal status set to : SIGINT if the child pid namespace
      called "halt/poweroff" and SIGHUP if the child pid namespace called
      "reboot".  When the reboot syscall is called and we are not in the initial
      pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
      "RESTART", and "RESTART2".  Otherwise we return EINVAL.
      
      Returning EINVAL is also an easy way to check if this feature is supported
      by the kernel when invoking another 'reboot' option like CAD.
      
      By this way the parent process of the child pid namespace knows if it
      rebooted or not and can take the right decision.
      
      Test case:
      ==========
      
      #include <alloca.h>
      #include <stdio.h>
      #include <sched.h>
      #include <unistd.h>
      #include <signal.h>
      #include <sys/reboot.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      
      #include <linux/reboot.h>
      
      static int do_reboot(void *arg)
      {
              int *cmd = arg;
      
              if (reboot(*cmd))
                      printf("failed to reboot(%d): %m\n", *cmd);
      }
      
      int test_reboot(int cmd, int sig)
      {
              long stack_size = 4096;
              void *stack = alloca(stack_size) + stack_size;
              int status;
              pid_t ret;
      
              ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
              if (ret < 0) {
                      printf("failed to clone: %m\n");
                      return -1;
              }
      
              if (wait(&status) < 0) {
                      printf("unexpected wait error: %m\n");
                      return -1;
              }
      
              if (!WIFSIGNALED(status)) {
                      printf("child process exited but was not signaled\n");
                      return -1;
              }
      
              if (WTERMSIG(status) != sig) {
                      printf("signal termination is not the one expected\n");
                      return -1;
              }
      
              return 0;
      }
      
      int main(int argc, char *argv[])
      {
              int status;
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
              if (status >= 0) {
                      printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
                      return 1;
              }
              printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");
      
              return 0;
      }
      
      [akpm@linux-foundation.org: tweak and add comments]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3f8921
    • A
      sysctl: use bitmap library functions · 5a04cca6
      Akinobu Mita 提交于
      Use bitmap_set() instead of using set_bit() for each bit.  This conversion
      is valid because the bitmap is private in the function call and atomic
      bitops were unnecessary.
      
      This also includes minor change.
      - Use bitmap_copy() for shorter typing
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a04cca6
    • Z
      kexec: add further check to crashkernel · eaa3be6a
      Zhenzhong Duan 提交于
      When using crashkernel=2M-256M, the kernel doesn't give any warning.  This
      is misleading sometimes.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eaa3be6a
    • W
      kexec: crash: don't save swapper_pg_dir for !CONFIG_MMU configurations · d034cfab
      Will Deacon 提交于
      nommu platforms don't have very interesting swapper_pg_dir pointers and
      usually just #define them to NULL, meaning that we can't include them in
      the vmcoreinfo on the kexec crash path.
      
      This patch only saves the swapper_pg_dir if we have an MMU.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d034cfab
    • G
      smp: add func to IPI cpus based on parameter func · b3a7e98e
      Gilad Ben-Yossef 提交于
      Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
      calculates the cpumask of cpus to IPI by calling a function supplied as a
      parameter in order to determine whether to IPI each specific cpu.
      
      The function works around allocation failure of cpumask variable in
      CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
      via smp_call_function_single().
      
      The function is useful since it allows to seperate the specific code that
      decided in each case whether to IPI a specific cpu for a specific request
      from the common boilerplate code of handling creating the mask, handling
      failures etc.
      
      [akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
      [akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
      [akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Reviewed-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3a7e98e
    • G
      smp: introduce a generic on_each_cpu_mask() function · 3fc498f1
      Gilad Ben-Yossef 提交于
      We have lots of infrastructure in place to partition multi-core systems
      such that we have a group of CPUs that are dedicated to specific task:
      cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
      Still, kernel code will at times interrupt all CPUs in the system via IPIs
      for various needs.  These IPIs are useful and cannot be avoided
      altogether, but in certain cases it is possible to interrupt only specific
      CPUs that have useful work to do and not the entire system.
      
      This patch set, inspired by discussions with Peter Zijlstra and Frederic
      Weisbecker when testing the nohz task patch set, is a first stab at trying
      to explore doing this by locating the places where such global IPI calls
      are being made and turning the global IPI into an IPI for a specific group
      of CPUs.  The purpose of the patch set is to get feedback if this is the
      right way to go for dealing with this issue and indeed, if the issue is
      even worth dealing with at all.  Based on the feedback from this patch set
      I plan to offer further patches that address similar issue in other code
      paths.
      
      This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
      infrastructure API (the former derived from existing arch specific
      versions in Tile and Arm) and uses them to turn several global IPI
      invocation to per CPU group invocations.
      
      Core kernel:
      
      on_each_cpu_mask() calls a function on processors specified by cpumask,
      which may or may not include the local processor.
      
      You must not call this function with disabled interrupts or from a
      hardware interrupt handler or from a bottom half handler.
      
      arch/arm:
      
      Note that the generic version is a little different then the Arm one:
      
      1. It has the mask as first parameter
      2. It calls the function on the calling CPU with interrupts disabled,
         but this should be OK since the function is called on the other CPUs
         with interrupts disabled anyway.
      
      arch/tile:
      
      The API is the same as the tile private one, but the generic version
      also calls the function on the with interrupts disabled in UP case
      
      This is OK since the function is called on the other CPUs
      with interrupts disabled.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc498f1
    • D
      Remove all #inclusions of asm/system.h · 9ffc93f2
      David Howells 提交于
      Remove all #inclusions of asm/system.h preparatory to splitting and killing
      it.  Performed with the following command:
      
      perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9ffc93f2
    • D
      Add #includes needed to permit the removal of asm/system.h · 96f951ed
      David Howells 提交于
      asm/system.h is a cause of circular dependency problems because it contains
      commonly used primitive stuff like barrier definitions and uncommonly used
      stuff like switch_to() that might require MMU definitions.
      
      asm/system.h has been disintegrated by this point on all arches into the
      following common segments:
      
       (1) asm/barrier.h
      
           Moved memory barrier definitions here.
      
       (2) asm/cmpxchg.h
      
           Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.
      
       (3) asm/bug.h
      
           Moved die() and similar here.
      
       (4) asm/exec.h
      
           Moved arch_align_stack() here.
      
       (5) asm/elf.h
      
           Moved AT_VECTOR_SIZE_ARCH here.
      
       (6) asm/switch_to.h
      
           Moved switch_to() here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      96f951ed
    • D
      Disintegrate asm/system.h for Sparc · d550bbd4
      David Howells 提交于
      Disintegrate asm/system.h for Sparc.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: sparclinux@vger.kernel.org
      d550bbd4
  7. 28 3月, 2012 2 次提交
  8. 27 3月, 2012 2 次提交
    • M
      sched/rt: Improve pick_next_highest_task_rt() · 1b028abc
      Michael J Wang 提交于
      Avoid extra work by continuing on to the next rt_rq if the highest
      prio task in current rt_rq is the same priority as our candidate
      task.
      
      More detailed explanation:  if next is not NULL, then we have found a
      candidate task, and its priority is next->prio.  Now we are looking
      for an even higher priority task in the other rt_rq's.  idx is the
      highest priority in the current candidate rt_rq.  In the current 3.3
      code, if idx is equal to next->prio, we would start scanning the tasks
      in that rt_rq and replace the current candidate task with a task from
      that rt_rq.  But the new task would only have a priority that is equal
      to our previous candidate task, so we have not advanced our goal of
      finding a higher prio task.  So we should avoid the extra work by
      continuing on to the next rt_rq if idx is equal to next->prio.
      Signed-off-by: NMichael J Wang <mjwang@broadcom.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1b028abc
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9
  9. 26 3月, 2012 3 次提交
    • S
      module: Remove module size limit · f946eeb9
      Sasha Levin 提交于
      Module size was limited to 64MB, this was legacy limitation due to vmalloc()
      which was removed a while ago.
      
      Limiting module size to 64MB is both pointless and affects real world use
      cases.
      
      Cc: Tim Abbott <tim.abbott@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      f946eeb9
    • S
      module: move __module_get and try_module_get() out of line. · d53799be
      Steven Rostedt 提交于
      With the preempt, tracepoint and everything, it's getting a bit
      chubby.  For an Ubuntu-based config:
      
      Before:
      	$ size -t `find * -name '*.ko'` | grep TOTAL
      	56199906        3870760	1606616	61677282	3ad1ee2	(TOTALS)
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	8509342	 850368	3358720	12718430	 c2115e	vmlinux
      
      After:
      	$ size -t `find * -name '*.ko'` | grep TOTAL
      	56183760	3867892	1606616	61658268	3acd49c	(TOTALS)
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	8501842	 849088	3358720	12709650	 c1ef12	vmlinux
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (made all out-of-line)
      d53799be
    • P
      params: <level>_initcall-like kernel parameters · 026cee00
      Pawel Moll 提交于
      This patch adds a set of macros that can be used to declare
      kernel parameters to be parsed _before_ initcalls at a chosen
      level are executed.  We rename the now-unused "flags" field of
      struct kernel_param as the level.  It's signed, for when we
      use this for early params as well, in future.
      
      Linker macro collating init calls had to be modified in order
      to add additional symbols between levels that are later used
      by the init code to split the calls into blocks.
      Signed-off-by: NPawel Moll <pawel.moll@arm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      026cee00