1. 10 8月, 2017 7 次提交
    • P
      cpuset: Make nr_cpusets private · be040bea
      Paolo Bonzini 提交于
      Any use of key->enabled (that is static_key_enabled and static_key_count)
      outside jump_label_lock should handle its own serialization.  In the case
      of cpusets_enabled_key, the key is always incremented/decremented under
      cpuset_mutex, and hence the same rule applies to nr_cpusets.  The rule
      *is* respected currently, but the mutex is static so nr_cpusets should
      be static too.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be040bea
    • P
      jump_label: Fix concurrent static_key_enable/disable() · 1dbb6704
      Paolo Bonzini 提交于
      static_key_enable/disable are trying to cap the static key count to
      0/1.  However, their use of key->enabled is outside jump_label_lock
      so they do not really ensure that.
      
      Rewrite them to do a quick check for an already enabled (respectively,
      already disabled), and then recheck under the jump label lock.  Unlike
      static_key_slow_inc/dec, a failed check under the jump label lock does
      not modify key->enabled.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1501601046-35683-2-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dbb6704
    • K
      locking/rwsem-xadd: Add killable versions of rwsem_down_read_failed() · 83ced169
      Kirill Tkhai 提交于
      Rename rwsem_down_read_failed() in __rwsem_down_read_failed_common()
      and teach it to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for (waiter.task) under spinlock before we go to out_nolock
      path. Current task wasn't able to be woken up, so there are
      a writer, owning the sem, or a writer, which is the first waiter.
      In the both cases we shouldn't wake anybody. If there is a writer,
      owning the sem, and we were the only waiter, remove RWSEM_WAITING_BIAS,
      as there are no waiters anymore.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789534632.9059.2901382369609922565.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83ced169
    • K
      locking/rwsem-spinlock: Add killable versions of __down_read() · 0aa1125f
      Kirill Tkhai 提交于
      Rename __down_read() in __down_read_common() and teach it
      to abort waiting in case of pending signals and killable
      state argument passed.
      
      Note, that we shouldn't wake anybody up in EINTR path, as:
      
      We check for signal_pending_state() after (!waiter.task)
      test and under spinlock. So, current task wasn't able to
      be woken up. It may be in two cases: a writer is owner
      of the sem, or a writer is a first waiter of the sem.
      
      If a writer is owner of the sem, no one else may work
      with it in parallel. It will wake somebody, when it
      call up_write() or downgrade_write().
      
      If a writer is the first waiter, it will be woken up,
      when the last active reader releases the sem, and
      sem->count became 0.
      
      Also note, that set_current_state() may be moved down
      to schedule() (after !waiter.task check), as all
      assignments in this type of semaphore (including wake_up),
      occur under spinlock, so we can't miss anything.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arnd@arndb.de
      Cc: avagin@virtuozzo.com
      Cc: davem@davemloft.net
      Cc: fenghua.yu@intel.com
      Cc: gorcunov@virtuozzo.com
      Cc: heiko.carstens@de.ibm.com
      Cc: hpa@zytor.com
      Cc: ink@jurassic.park.msu.ru
      Cc: mattst88@gmail.com
      Cc: rth@twiddle.net
      Cc: schwidefsky@de.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/149789533283.9059.9829416940494747182.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0aa1125f
    • P
      locking/osq_lock: Fix osq_lock queue corruption · 50972fe7
      Prateek Sood 提交于
      Fix ordering of link creation between node->prev and prev->next in
      osq_lock(). A case in which the status of optimistic spin queue is
      CPU6->CPU2 in which CPU6 has acquired the lock.
      
              tail
                v
        ,-. <- ,-.
        |6|    |2|
        `-' -> `-'
      
      At this point if CPU0 comes in to acquire osq_lock, it will update the
      tail count.
      
        CPU2			CPU0
        ----------------------------------
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-' -> `-'    `-'
      
      After tail count update if CPU2 starts to unqueue itself from
      optimistic spin queue, it will find an updated tail count with CPU0 and
      update CPU2 node->next to NULL in osq_wait_next().
      
        unqueue-A
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-B
      
        ->tail != curr && !node->next
      
      If reordering of following stores happen then prev->next where prev
      being CPU2 would be updated to point to CPU0 node:
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-'    `-' -> `-'
      
        osq_wait_next()
          node->next <- 0
          xchg(node->next, NULL)
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-C
      
      At this point if next instruction
      	WRITE_ONCE(next->prev, prev);
      in CPU2 path is committed before the update of CPU0 node->prev = prev then
      CPU0 node->prev will point to CPU6 node.
      
      	       tail
          v----------. v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
           `----------^
      
      At this point if CPU0 path's node->prev = prev is committed resulting
      in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
      currently,
      
      				       tail
      			                 v
      			  ,-. <- ,-. <- ,-.
      			  |6|    |2|    |0|
      			  `-'    `-'    `-'
      			     `----------^
      
      so if CPU0 gets into unqueue path of osq_lock it will keep spinning
      in infinite loop as condition prev->next == node will never be true.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      [ Added pictures, rewrote comments. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50972fe7
    • B
      sched/wait: Remove the lockless swait_active() check in swake_up*() · 35a2897c
      Boqun Feng 提交于
      Steven Rostedt reported a potential race in RCU core because of
      swake_up():
      
              CPU0                            CPU1
              ----                            ----
                                      __call_rcu_core() {
      
                                       spin_lock(rnp_root)
                                       need_wake = __rcu_start_gp() {
                                        rcu_start_gp_advanced() {
                                         gp_flags = FLAG_INIT
                                        }
                                       }
      
       rcu_gp_kthread() {
         swait_event_interruptible(wq,
              gp_flags & FLAG_INIT) {
         spin_lock(q->lock)
      
                                      *fetch wq->task_list here! *
      
         list_add(wq->task_list, q->task_list)
         spin_unlock(q->lock);
      
         *fetch old value of gp_flags here *
      
                                       spin_unlock(rnp_root)
      
                                       rcu_gp_kthread_wake() {
                                        swake_up(wq) {
                                         swait_active(wq) {
                                          list_empty(wq->task_list)
      
                                         } * return false *
      
        if (condition) * false *
          schedule();
      
      In this case, a wakeup is missed, which could cause the rcu_gp_kthread
      waits for a long time.
      
      The reason of this is that we do a lockless swait_active() check in
      swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
      before swait_active() to provide the proper order or 2) simply remove
      the swait_active() in swake_up().
      
      The solution 2 not only fixes this problem but also keeps the swait and
      wait API as close as possible, as wake_up() doesn't provide a full
      barrier and doesn't do a lockless check of the wait queue either.
      Moreover, there are users already using swait_active() to do their quick
      checks for the wait queues, so it make less sense that swake_up() and
      swake_up_all() do this on their own.
      
      This patch then removes the lockless swait_active() check in swake_up()
      and swake_up_all().
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Krister Johansen <kjlx@templeofstupid.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardisSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35a2897c
    • M
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman 提交于
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
  2. 07 8月, 2017 1 次提交
  3. 03 8月, 2017 2 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • K
      pid: kill pidhash_size in pidhash_init() · 27e37d84
      Kefeng Wang 提交于
      After commit 3d375d78 ("mm: update callers to use HASH_ZERO flag"),
      drop unused pidhash_size in pidhash_init().
      
      Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NPavel Tatashin <Pasha.Tatashin@Oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27e37d84
  4. 01 8月, 2017 2 次提交
  5. 30 7月, 2017 2 次提交
  6. 28 7月, 2017 1 次提交
    • M
      workqueue: Work around edge cases for calc of pool's cpumask · 1ad0f0a7
      Michael Bringmann 提交于
      There is an underlying assumption/trade-off in many layers of the Linux
      system that CPU <-> node mapping is static.  This is despite the presence
      of features like NUMA and 'hotplug' that support the dynamic addition/
      removal of fundamental system resources like CPUs and memory.  PowerPC
      systems, however, do provide extensive features for the dynamic change
      of resources available to a system.
      
      Currently, there is little or no synchronization protection around the
      updating of the CPU <-> node mapping, and the export/update of this
      information for other layers / modules.  In systems which can change
      this mapping during 'hotplug', like PowerPC, the information is changing
      underneath all layers that might reference it.
      
      This patch attempts to ensure that a valid, usable cpumask attribute
      is used by the workqueue infrastructure when setting up new resource
      pools.  It prevents a crash that has been observed when an 'empty'
      cpumask is passed along to the worker/task scheduling code.  It is
      intended as a temporary workaround until a more fundamental review and
      correction of the issue can be done.
      
      [With additions to the patch provided by Tejun Hao <tj@kernel.org>]
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1ad0f0a7
  7. 27 7月, 2017 1 次提交
    • T
      genirq/cpuhotplug: Revert "Set force affinity flag on hotplug migration" · 83979133
      Thomas Gleixner 提交于
      That commit was part of the changes moving x86 to the generic CPU hotplug
      interrupt migration code. The force flag was required on x86 before the
      hierarchical irqdomain rework, but invoking set_affinity() with force=true
      stayed and had no side effects.
      
      At some point in the past, the force flag got repurposed to support the
      exynos timer interrupt affinity setting to a not yet online CPU, so the
      interrupt controller callback does not verify the supplied affinity mask
      against cpu_online_mask.
      
      Setting the flag in the CPU hotplug code causes the cpu online masking to
      be blocked on these irq controllers and results in potentially affining an
      interrupt to the CPU which is unplugged, i.e. instead of moving it away,
      it's just reassigned to it.
      
      As the force flags is not longer needed on x86, it's safe to revert that
      patch so the ARM irqchips which use the force flag work again.
      
      Add comments to that effect, so this won't happen again.
      
      Note: The online mask handling should be done in the generic code and the
      force flag and the masking in the irq chips removed all together, but
      that's not a change possible for 4.13. 
      
      Fixes: 77f85e66 ("genirq/cpuhotplug: Set force affinity flag on hotplug migration")
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: LAK <linux-arm-kernel@lists.infradead.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707271217590.3109@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      83979133
  8. 26 7月, 2017 1 次提交
    • T
      workqueue: implicit ordered attribute should be overridable · 0a94efb5
      Tejun Heo 提交于
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered") automatically enabled ordered attribute for unbound
      workqueues w/ max_active == 1.  Because ordered workqueues reject
      max_active and some attribute changes, this implicit ordered mode
      broke cases where the user creates an unbound workqueue w/ max_active
      == 1 and later explicitly changes the related attributes.
      
      This patch distinguishes explicit and implicit ordered setting and
      overrides from attribute changes if implict.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      0a94efb5
  9. 25 7月, 2017 2 次提交
  10. 23 7月, 2017 1 次提交
  11. 21 7月, 2017 2 次提交
    • J
      perf/core: Fix locking for children siblings group read · 2aeb1883
      Jiri Olsa 提交于
      We're missing ctx lock when iterating children siblings
      within the perf_read path for group reading. Following
      race and crash can happen:
      
      User space doing read syscall on event group leader:
      
      T1:
        perf_read
          lock event->ctx->mutex
          perf_read_group
            lock leader->child_mutex
            __perf_read_group_add(child)
              list_for_each_entry(sub, &leader->sibling_list, group_entry)
      
      ---->   sub might be invalid at this point, because it could
              get removed via perf_event_exit_task_context in T2
      
      Child exiting and cleaning up its events:
      
      T2:
        perf_event_exit_task_context
          lock ctx->mutex
          list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
            perf_event_exit_event(child)
              lock ctx->lock
              perf_group_detach(child)
              unlock ctx->lock
      
      ---->   child is removed from sibling_list without any sync
              with T1 path above
      
              ...
              free_event(child)
      
      Before the child is removed from the leader's child_list,
      (and thus is omitted from perf_read_group processing), we
      need to ensure that perf_read_group touches child's
      siblings under its ctx->lock.
      
      Peter further notes:
      
      | One additional note; this bug got exposed by commit:
      |
      |   ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      |
      | which made it possible to actually trigger this code-path.
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aeb1883
    • D
      bpf: fix mixed signed/unsigned derived min/max value bounds · 4cabc5b1
      Daniel Borkmann 提交于
      Edward reported that there's an issue in min/max value bounds
      tracking when signed and unsigned compares both provide hints
      on limits when having unknown variables. E.g. a program such
      as the following should have been rejected:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff8a94cda93400
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        11: (65) if r1 s> 0x1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      What happens is that in the first part ...
      
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
      
      ... r1 carries an unsigned value, and is compared as unsigned
      against a register carrying an immediate. Verifier deduces in
      reg_set_min_max() that since the compare is unsigned and operation
      is greater than (>), that in the fall-through/false case, r1's
      minimum bound must be 0 and maximum bound must be r2. Latter is
      larger than the bound and thus max value is reset back to being
      'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
      'R1=inv,min_value=0'. The subsequent test ...
      
        11: (65) if r1 s> 0x1 goto pc+2
      
      ... is a signed compare of r1 with immediate value 1. Here,
      verifier deduces in reg_set_min_max() that since the compare
      is signed this time and operation is greater than (>), that
      in the fall-through/false case, we can deduce that r1's maximum
      bound must be 1, meaning with prior test, we result in r1 having
      the following state: R1=inv,min_value=0,max_value=1. Given that
      the actual value this holds is -8, the bounds are wrongly deduced.
      When this is being added to r0 which holds the map_value(_adj)
      type, then subsequent store access in above case will go through
      check_mem_access() which invokes check_map_access_adj(), that
      will then probe whether the map memory is in bounds based
      on the min_value and max_value as well as access size since
      the actual unknown value is min_value <= x <= max_value; commit
      fce366a9 ("bpf, verifier: fix alu ops against map_value{,
      _adj} register types") provides some more explanation on the
      semantics.
      
      It's worth to note in this context that in the current code,
      min_value and max_value tracking are used for two things, i)
      dynamic map value access via check_map_access_adj() and since
      commit 06c1c049 ("bpf: allow helpers access to variable memory")
      ii) also enforced at check_helper_mem_access() when passing a
      memory address (pointer to packet, map value, stack) and length
      pair to a helper and the length in this case is an unknown value
      defining an access range through min_value/max_value in that
      case. The min_value/max_value tracking is /not/ used in the
      direct packet access case to track ranges. However, the issue
      also affects case ii), for example, the following crafted program
      based on the same principle must be rejected as well:
      
         0: (b7) r2 = 0
         1: (bf) r3 = r10
         2: (07) r3 += -512
         3: (7a) *(u64 *)(r10 -16) = -8
         4: (79) r4 = *(u64 *)(r10 -16)
         5: (b7) r6 = -1
         6: (2d) if r4 > r6 goto pc+5
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
         7: (65) if r4 s> 0x1 goto pc+4
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1
        R10=fp
         8: (07) r4 += 1
         9: (b7) r5 = 0
        10: (6a) *(u16 *)(r10 -512) = 0
        11: (85) call bpf_skb_load_bytes#26
        12: (b7) r0 = 0
        13: (95) exit
      
      Meaning, while we initialize the max_value stack slot that the
      verifier thinks we access in the [1,2] range, in reality we
      pass -7 as length which is interpreted as u32 in the helper.
      Thus, this issue is relevant also for the case of helper ranges.
      Resetting both bounds in check_reg_overflow() in case only one
      of them exceeds limits is also not enough as similar test can be
      created that uses values which are within range, thus also here
      learned min value in r1 is incorrect when mixed with later signed
      test to create a range:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff880ad081fa00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (65) if r1 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      This leaves us with two options for fixing this: i) to invalidate
      all prior learned information once we switch signed context, ii)
      to track min/max signed and unsigned boundaries separately as
      done in [0]. (Given latter introduces major changes throughout
      the whole verifier, it's rather net-next material, thus this
      patch follows option i), meaning we can derive bounds either
      from only signed tests or only unsigned tests.) There is still the
      case of adjust_reg_min_max_vals(), where we adjust bounds on ALU
      operations, meaning programs like the following where boundaries
      on the reg get mixed in context later on when bounds are merged
      on the dst reg must get rejected, too:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff89b2bf87ce00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+6
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (b7) r7 = 1
        12: (65) if r7 s> 0x0 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
        13: (b7) r0 = 0
        14: (95) exit
      
        from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
        15: (0f) r7 += r1
        16: (65) if r7 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        17: (0f) r0 += r7
        18: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        19: (b7) r0 = 0
        20: (95) exit
      
      Meaning, in adjust_reg_min_max_vals() we must also reset range
      values on the dst when src/dst registers have mixed signed/
      unsigned derived min/max value bounds with one unbounded value
      as otherwise they can be added together deducing false boundaries.
      Once both boundaries are established from either ALU ops or
      compare operations w/o mixing signed/unsigned insns, then they
      can safely be added to other regs also having both boundaries
      established. Adding regs with one unbounded side to a map value
      where the bounded side has been learned w/o mixing ops is
      possible, but the resulting map value won't recover from that,
      meaning such op is considered invalid on the time of actual
      access. Invalid bounds are set on the dst reg in case i) src reg,
      or ii) in case dst reg already had them. The only way to recover
      would be to perform i) ALU ops but only 'add' is allowed on map
      value types or ii) comparisons, but these are disallowed on
      pointers in case they span a range. This is fine as only BPF_JEQ
      and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
      which potentially turn them into PTR_TO_MAP_VALUE type depending
      on the branch, so only here min/max value cannot be invalidated
      for them.
      
      In terms of state pruning, value_from_signed is considered
      as well in states_equal() when dealing with adjusted map values.
      With regards to breaking existing programs, there is a small
      risk, but use-cases are rather quite narrow where this could
      occur and mixing compares probably unlikely.
      
      Joint work with Josef and Edward.
      
        [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Reported-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cabc5b1
  12. 20 7月, 2017 3 次提交
    • C
      trace: fix the errors caused by incompatible type of RCU variables · f86f4180
      Chunyan Zhang 提交于
      The variables which are processed by RCU functions should be annotated
      as RCU, otherwise sparse will report the errors like below:
      
      "error: incompatible types in comparison expression (different
      address spaces)"
      
      Link: http://lkml.kernel.org/r/1496823171-7758-1-git-send-email-zhang.chunyan@linaro.orgSigned-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
      [ Updated to not be 100% 80 column strict ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f86f4180
    • C
      tracing: Fix kmemleak in instance_rmdir · db9108e0
      Chunyu Hu 提交于
      Hit the kmemleak when executing instance_rmdir, it forgot releasing
      mem of tracing_cpumask. With this fix, the warn does not appear any
      more.
      
      unreferenced object 0xffff93a8dfaa7c18 (size 8):
        comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s)
        hex dump (first 8 bytes):
          ff ff ff ff ff ff ff ff                          ........
        backtrace:
          [<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280
          [<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30
          [<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10
          [<ffffffff88571ab0>] instance_mkdir+0x90/0x240
          [<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70
          [<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0
          [<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100
          [<ffffffff88403857>] do_syscall_64+0x67/0x150
          [<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: ccfe9e42 ("tracing: Make tracing_cpumask available for all instances")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      db9108e0
    • A
      perf/core: Fix scheduling regression of pinned groups · 3bda69c1
      Alexander Shishkin 提交于
      Vince Weaver reported:
      
      > I was tracking down some regressions in my perf_event_test testsuite.
      > Some of the tests broke in the 4.11-rc1 timeframe.
      >
      > I've bisected one of them, this report is about
      >	tests/overflow/simul_oneshot_group_overflow
      > This test creates an event group containing two sampling events, set
      > to overflow to a signal handler (which disables and then refreshes the
      > event).
      >
      > On a good kernel you get the following:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 946 (perf::instructions/1000000)
      > 		fd 4 overflows: 473 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946379875
      > 		Count 1: 946365218
      >
      > With the broken kernels you get:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 938 (perf::instructions/1000000)
      > 		fd 4 overflows: 318 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946373080
      > 		Count 1: 653373058
      
      The root cause of the bug is that the following commit:
      
        487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      
      erronously assumed that event's 'pinned' setting determines whether the
      event belongs to a pinned group or not, but in fact, it's the group
      leader's pinned state that matters.
      
      This was discovered by Vince in the test case described above, where two instruction
      counters are grouped, the group leader is pinned, but the other event is not;
      in the regressed case the counters were off by 33% (the difference between events'
      periods), but should be the same within the error margin.
      
      Fix the problem by looking at the group leader's pinning.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/87lgnmvw7h.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3bda69c1
  13. 19 7月, 2017 4 次提交
    • T
      workqueue: restore WQ_UNBOUND/max_active==1 to be ordered · 5c0338c6
      Tejun Heo 提交于
      The combination of WQ_UNBOUND and max_active == 1 used to imply
      ordered execution.  After NUMA affinity 4c16bd32 ("workqueue:
      implement NUMA affinity for unbound workqueues"), this is no longer
      true due to per-node worker pools.
      
      While the right way to create an ordered workqueue is
      alloc_ordered_workqueue(), the documentation has been misleading for a
      long time and people do use WQ_UNBOUND and max_active == 1 for ordered
      workqueues which can lead to subtle bugs which are very difficult to
      trigger.
      
      It's unlikely that we'd see noticeable performance impact by enforcing
      ordering on WQ_UNBOUND / max_active == 1 workqueues.  Let's
      automatically set __WQ_ORDERED for those workqueues.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Reported-by: NAlexei Potashnik <alexei@purestorage.com>
      Fixes: 4c16bd32 ("workqueue: implement NUMA affinity for unbound workqueues")
      Cc: stable@vger.kernel.org # v3.10+
      5c0338c6
    • S
      audit: fix memleak in auditd_send_unicast_skb. · b0659ae5
      Shu Wang 提交于
      Found this issue by kmemleak report, auditd_send_unicast_skb
      did not free skb if rcu_dereference(auditd_conn) returns null.
      
      unreferenced object 0xffff88082568ce00 (size 256):
      comm "auditd", pid 1119, jiffies 4294708499
      backtrace:
      [<ffffffff8176166a>] kmemleak_alloc+0x4a/0xa0
      [<ffffffff8121820c>] kmem_cache_alloc_node+0xcc/0x210
      [<ffffffff8161b99d>] __alloc_skb+0x5d/0x290
      [<ffffffff8113c614>] audit_make_reply+0x54/0xd0
      [<ffffffff8113dfa7>] audit_receive_msg+0x967/0xd70
      ----------------
      (gdb) list *audit_receive_msg+0x967
      0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133).
      1132    skb = audit_make_reply(0, AUDIT_REPLACE, 0,
                                      0, &pvnr, sizeof(pvnr));
      ---------------
      [<ffffffff8113e402>] audit_receive+0x52/0xa0
      [<ffffffff8166c561>] netlink_unicast+0x181/0x240
      [<ffffffff8166c8e2>] netlink_sendmsg+0x2c2/0x3b0
      [<ffffffff816112e8>] sock_sendmsg+0x38/0x50
      [<ffffffff816117a2>] SYSC_sendto+0x102/0x190
      [<ffffffff81612f4e>] SyS_sendto+0xe/0x10
      [<ffffffff8176d337>] entry_SYSCALL_64_fastpath+0x1a/0xa5
      [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b0659ae5
    • J
      tracing/ring_buffer: Try harder to allocate · 84861885
      Joel Fernandes 提交于
      ftrace can fail to allocate per-CPU ring buffer on systems with a large
      number of CPUs coupled while large amounts of cache happening in the
      page cache. Currently the ring buffer allocation doesn't retry in the VM
      implementation even if direct-reclaim made some progress but still
      wasn't able to find a free page. On retrying I see that the allocations
      almost always succeed. The retry doesn't happen because __GFP_NORETRY is
      used in the tracer to prevent the case where we might OOM, however if we
      drop __GFP_NORETRY, we risk destabilizing the system if OOM killer is
      triggered. To prevent this situation, use the __GFP_RETRY_MAYFAIL flag
      introduced recently [1].
      
      Tested the following still succeeds without destabilizing a system with
      1GB memory.
      echo 300000 > /sys/kernel/debug/tracing/buffer_size_kb
      
      [1] https://marc.info/?l=linux-mm&m=149820805124906&w=2
      
      Link: http://lkml.kernel.org/r/20170713021416.8897-1-joelaf@google.com
      
      Cc: Tim Murray <timmurray@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      84861885
    • T
      cgroup: create dfl_root files on subsys registration · 7af608e4
      Tejun Heo 提交于
      On subsystem registration, css_populate_dir() is not called on the new
      root css, so the interface files for the subsystem on cgrp_dfl_root
      aren't created on registration.  This is a residue from the days when
      cgrp_dfl_root was used only as the parking spot for unused subsystems,
      which no longer is true as it's used as the root for cgroup2.
      
      This is often fine as later operations tend to create them as a part
      of mount (cgroup1) or subtree_control operations (cgroup2); however,
      it's not difficult to mount cgroup2 with the controller interface
      files missing as Waiman found out.
      
      Fix it by invoking css_populate_dir() on the root css on subsys
      registration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NWaiman Long <longman@redhat.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7af608e4
  14. 18 7月, 2017 1 次提交
  15. 15 7月, 2017 2 次提交
  16. 14 7月, 2017 2 次提交
  17. 13 7月, 2017 6 次提交