• T
    cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
    Tejun Heo 提交于
    Creation of a kthread goes through a couple interlocked stages between
    the kthread itself and its creator.  Once the new kthread starts
    running, it initializes itself and wakes up the creator.  The creator
    then can further configure the kthread and then let it start doing its
    job by waking it up.
    
    In this configuration-by-creator stage, the creator is the only one
    that can wake it up but the kthread is visible to userland.  When
    altering the kthread's attributes from userland is allowed, this is
    fine; however, for cases where CPU affinity is critical,
    kthread_bind() is used to first disable affinity changes from userland
    and then set the affinity.  This also prevents the kthread from being
    migrated into non-root cgroups as that can affect the CPU affinity and
    many other things.
    
    Unfortunately, the cgroup side of protection is racy.  While the
    PF_NO_SETAFFINITY flag prevents further migrations, userland can win
    the race before the creator sets the flag with kthread_bind() and put
    the kthread in a non-root cgroup, which can lead to all sorts of
    problems including incorrect CPU affinity and starvation.
    
    This bug got triggered by userland which periodically tries to migrate
    all processes in the root cpuset cgroup to a non-root one.  Per-cpu
    workqueue workers got caught while being created and ended up with
    incorrected CPU affinity breaking concurrency management and sometimes
    stalling workqueue execution.
    
    This patch adds task->no_cgroup_migration which disallows the task to
    be migrated by userland.  kthreadd starts with the flag set making
    every child kthread start in the root cgroup with migration
    disallowed.  The flag is cleared after the kthread finishes
    initialization by which time PF_NO_SETAFFINITY is set if the kthread
    should stay in the root cgroup.
    
    It'd be better to wait for the initialization instead of failing but I
    couldn't think of a way of implementing that without adding either a
    new PF flag, or sleeping and retrying from waiting side.  Even if
    userland depends on changing cgroup membership of a kthread, it either
    has to be synchronized with kthread_create() or periodically repeat,
    so it's unlikely that this would break anything.
    
    v2: Switch to a simpler implementation using a new task_struct bit
        field suggested by Oleg.
    Signed-off-by: NTejun Heo <tj@kernel.org>
    Suggested-by: NOleg Nesterov <oleg@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Reported-and-debugged-by: NChris Mason <clm@fb.com>
    Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
    Signed-off-by: NTejun Heo <tj@kernel.org>
    77f88796
kthread.c 32.1 KB