1. 17 8月, 2017 1 次提交
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
  2. 29 7月, 2017 1 次提交
    • T
      sched: Allow migrating kthreads into online but inactive CPUs · 955dbdf4
      Tejun Heo 提交于
      Per-cpu workqueues have been tripping CPU affinity sanity checks while
      a CPU is being offlined.  A per-cpu kworker ends up running on a CPU
      which isn't its target CPU while the CPU is online but inactive.
      
      While the scheduler allows kthreads to wake up on an online but
      inactive CPU, it doesn't allow a running kthread to be migrated to
      such a CPU, which leads to an odd situation where setting affinity on
      a sleeping and running kthread leads to different results.
      
      Each mem-reclaim workqueue has one rescuer which guarantees forward
      progress and the rescuer needs to bind itself to the CPU which needs
      help in making forward progress; however, due to the above issue,
      while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on
      the correct CPU if the CPU is in the process of going offline,
      tripping the sanity check and executing the work item on the wrong
      CPU.
      
      This patch updates __migrate_task() so that kthreads can be migrated
      into an inactive but online CPU.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      955dbdf4
  3. 21 7月, 2017 2 次提交
    • J
      perf/core: Fix locking for children siblings group read · 2aeb1883
      Jiri Olsa 提交于
      We're missing ctx lock when iterating children siblings
      within the perf_read path for group reading. Following
      race and crash can happen:
      
      User space doing read syscall on event group leader:
      
      T1:
        perf_read
          lock event->ctx->mutex
          perf_read_group
            lock leader->child_mutex
            __perf_read_group_add(child)
              list_for_each_entry(sub, &leader->sibling_list, group_entry)
      
      ---->   sub might be invalid at this point, because it could
              get removed via perf_event_exit_task_context in T2
      
      Child exiting and cleaning up its events:
      
      T2:
        perf_event_exit_task_context
          lock ctx->mutex
          list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
            perf_event_exit_event(child)
              lock ctx->lock
              perf_group_detach(child)
              unlock ctx->lock
      
      ---->   child is removed from sibling_list without any sync
              with T1 path above
      
              ...
              free_event(child)
      
      Before the child is removed from the leader's child_list,
      (and thus is omitted from perf_read_group processing), we
      need to ensure that perf_read_group touches child's
      siblings under its ctx->lock.
      
      Peter further notes:
      
      | One additional note; this bug got exposed by commit:
      |
      |   ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      |
      | which made it possible to actually trigger this code-path.
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aeb1883
    • D
      bpf: fix mixed signed/unsigned derived min/max value bounds · 4cabc5b1
      Daniel Borkmann 提交于
      Edward reported that there's an issue in min/max value bounds
      tracking when signed and unsigned compares both provide hints
      on limits when having unknown variables. E.g. a program such
      as the following should have been rejected:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff8a94cda93400
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        11: (65) if r1 s> 0x1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      What happens is that in the first part ...
      
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
      
      ... r1 carries an unsigned value, and is compared as unsigned
      against a register carrying an immediate. Verifier deduces in
      reg_set_min_max() that since the compare is unsigned and operation
      is greater than (>), that in the fall-through/false case, r1's
      minimum bound must be 0 and maximum bound must be r2. Latter is
      larger than the bound and thus max value is reset back to being
      'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
      'R1=inv,min_value=0'. The subsequent test ...
      
        11: (65) if r1 s> 0x1 goto pc+2
      
      ... is a signed compare of r1 with immediate value 1. Here,
      verifier deduces in reg_set_min_max() that since the compare
      is signed this time and operation is greater than (>), that
      in the fall-through/false case, we can deduce that r1's maximum
      bound must be 1, meaning with prior test, we result in r1 having
      the following state: R1=inv,min_value=0,max_value=1. Given that
      the actual value this holds is -8, the bounds are wrongly deduced.
      When this is being added to r0 which holds the map_value(_adj)
      type, then subsequent store access in above case will go through
      check_mem_access() which invokes check_map_access_adj(), that
      will then probe whether the map memory is in bounds based
      on the min_value and max_value as well as access size since
      the actual unknown value is min_value <= x <= max_value; commit
      fce366a9 ("bpf, verifier: fix alu ops against map_value{,
      _adj} register types") provides some more explanation on the
      semantics.
      
      It's worth to note in this context that in the current code,
      min_value and max_value tracking are used for two things, i)
      dynamic map value access via check_map_access_adj() and since
      commit 06c1c049 ("bpf: allow helpers access to variable memory")
      ii) also enforced at check_helper_mem_access() when passing a
      memory address (pointer to packet, map value, stack) and length
      pair to a helper and the length in this case is an unknown value
      defining an access range through min_value/max_value in that
      case. The min_value/max_value tracking is /not/ used in the
      direct packet access case to track ranges. However, the issue
      also affects case ii), for example, the following crafted program
      based on the same principle must be rejected as well:
      
         0: (b7) r2 = 0
         1: (bf) r3 = r10
         2: (07) r3 += -512
         3: (7a) *(u64 *)(r10 -16) = -8
         4: (79) r4 = *(u64 *)(r10 -16)
         5: (b7) r6 = -1
         6: (2d) if r4 > r6 goto pc+5
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
         7: (65) if r4 s> 0x1 goto pc+4
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1
        R10=fp
         8: (07) r4 += 1
         9: (b7) r5 = 0
        10: (6a) *(u16 *)(r10 -512) = 0
        11: (85) call bpf_skb_load_bytes#26
        12: (b7) r0 = 0
        13: (95) exit
      
      Meaning, while we initialize the max_value stack slot that the
      verifier thinks we access in the [1,2] range, in reality we
      pass -7 as length which is interpreted as u32 in the helper.
      Thus, this issue is relevant also for the case of helper ranges.
      Resetting both bounds in check_reg_overflow() in case only one
      of them exceeds limits is also not enough as similar test can be
      created that uses values which are within range, thus also here
      learned min value in r1 is incorrect when mixed with later signed
      test to create a range:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff880ad081fa00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (65) if r1 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      This leaves us with two options for fixing this: i) to invalidate
      all prior learned information once we switch signed context, ii)
      to track min/max signed and unsigned boundaries separately as
      done in [0]. (Given latter introduces major changes throughout
      the whole verifier, it's rather net-next material, thus this
      patch follows option i), meaning we can derive bounds either
      from only signed tests or only unsigned tests.) There is still the
      case of adjust_reg_min_max_vals(), where we adjust bounds on ALU
      operations, meaning programs like the following where boundaries
      on the reg get mixed in context later on when bounds are merged
      on the dst reg must get rejected, too:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff89b2bf87ce00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+6
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (b7) r7 = 1
        12: (65) if r7 s> 0x0 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
        13: (b7) r0 = 0
        14: (95) exit
      
        from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
        15: (0f) r7 += r1
        16: (65) if r7 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        17: (0f) r0 += r7
        18: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        19: (b7) r0 = 0
        20: (95) exit
      
      Meaning, in adjust_reg_min_max_vals() we must also reset range
      values on the dst when src/dst registers have mixed signed/
      unsigned derived min/max value bounds with one unbounded value
      as otherwise they can be added together deducing false boundaries.
      Once both boundaries are established from either ALU ops or
      compare operations w/o mixing signed/unsigned insns, then they
      can safely be added to other regs also having both boundaries
      established. Adding regs with one unbounded side to a map value
      where the bounded side has been learned w/o mixing ops is
      possible, but the resulting map value won't recover from that,
      meaning such op is considered invalid on the time of actual
      access. Invalid bounds are set on the dst reg in case i) src reg,
      or ii) in case dst reg already had them. The only way to recover
      would be to perform i) ALU ops but only 'add' is allowed on map
      value types or ii) comparisons, but these are disallowed on
      pointers in case they span a range. This is fine as only BPF_JEQ
      and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
      which potentially turn them into PTR_TO_MAP_VALUE type depending
      on the branch, so only here min/max value cannot be invalidated
      for them.
      
      In terms of state pruning, value_from_signed is considered
      as well in states_equal() when dealing with adjusted map values.
      With regards to breaking existing programs, there is a small
      risk, but use-cases are rather quite narrow where this could
      occur and mixing compares probably unlikely.
      
      Joint work with Josef and Edward.
      
        [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Reported-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cabc5b1
  4. 20 7月, 2017 3 次提交
    • C
      trace: fix the errors caused by incompatible type of RCU variables · f86f4180
      Chunyan Zhang 提交于
      The variables which are processed by RCU functions should be annotated
      as RCU, otherwise sparse will report the errors like below:
      
      "error: incompatible types in comparison expression (different
      address spaces)"
      
      Link: http://lkml.kernel.org/r/1496823171-7758-1-git-send-email-zhang.chunyan@linaro.orgSigned-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
      [ Updated to not be 100% 80 column strict ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f86f4180
    • C
      tracing: Fix kmemleak in instance_rmdir · db9108e0
      Chunyu Hu 提交于
      Hit the kmemleak when executing instance_rmdir, it forgot releasing
      mem of tracing_cpumask. With this fix, the warn does not appear any
      more.
      
      unreferenced object 0xffff93a8dfaa7c18 (size 8):
        comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s)
        hex dump (first 8 bytes):
          ff ff ff ff ff ff ff ff                          ........
        backtrace:
          [<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280
          [<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30
          [<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10
          [<ffffffff88571ab0>] instance_mkdir+0x90/0x240
          [<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70
          [<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0
          [<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100
          [<ffffffff88403857>] do_syscall_64+0x67/0x150
          [<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: ccfe9e42 ("tracing: Make tracing_cpumask available for all instances")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      db9108e0
    • A
      perf/core: Fix scheduling regression of pinned groups · 3bda69c1
      Alexander Shishkin 提交于
      Vince Weaver reported:
      
      > I was tracking down some regressions in my perf_event_test testsuite.
      > Some of the tests broke in the 4.11-rc1 timeframe.
      >
      > I've bisected one of them, this report is about
      >	tests/overflow/simul_oneshot_group_overflow
      > This test creates an event group containing two sampling events, set
      > to overflow to a signal handler (which disables and then refreshes the
      > event).
      >
      > On a good kernel you get the following:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 946 (perf::instructions/1000000)
      > 		fd 4 overflows: 473 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946379875
      > 		Count 1: 946365218
      >
      > With the broken kernels you get:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 938 (perf::instructions/1000000)
      > 		fd 4 overflows: 318 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946373080
      > 		Count 1: 653373058
      
      The root cause of the bug is that the following commit:
      
        487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      
      erronously assumed that event's 'pinned' setting determines whether the
      event belongs to a pinned group or not, but in fact, it's the group
      leader's pinned state that matters.
      
      This was discovered by Vince in the test case described above, where two instruction
      counters are grouped, the group leader is pinned, but the other event is not;
      in the regressed case the counters were off by 33% (the difference between events'
      periods), but should be the same within the error margin.
      
      Fix the problem by looking at the group leader's pinning.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/87lgnmvw7h.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3bda69c1
  5. 19 7月, 2017 2 次提交
    • S
      audit: fix memleak in auditd_send_unicast_skb. · b0659ae5
      Shu Wang 提交于
      Found this issue by kmemleak report, auditd_send_unicast_skb
      did not free skb if rcu_dereference(auditd_conn) returns null.
      
      unreferenced object 0xffff88082568ce00 (size 256):
      comm "auditd", pid 1119, jiffies 4294708499
      backtrace:
      [<ffffffff8176166a>] kmemleak_alloc+0x4a/0xa0
      [<ffffffff8121820c>] kmem_cache_alloc_node+0xcc/0x210
      [<ffffffff8161b99d>] __alloc_skb+0x5d/0x290
      [<ffffffff8113c614>] audit_make_reply+0x54/0xd0
      [<ffffffff8113dfa7>] audit_receive_msg+0x967/0xd70
      ----------------
      (gdb) list *audit_receive_msg+0x967
      0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133).
      1132    skb = audit_make_reply(0, AUDIT_REPLACE, 0,
                                      0, &pvnr, sizeof(pvnr));
      ---------------
      [<ffffffff8113e402>] audit_receive+0x52/0xa0
      [<ffffffff8166c561>] netlink_unicast+0x181/0x240
      [<ffffffff8166c8e2>] netlink_sendmsg+0x2c2/0x3b0
      [<ffffffff816112e8>] sock_sendmsg+0x38/0x50
      [<ffffffff816117a2>] SYSC_sendto+0x102/0x190
      [<ffffffff81612f4e>] SyS_sendto+0xe/0x10
      [<ffffffff8176d337>] entry_SYSCALL_64_fastpath+0x1a/0xa5
      [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b0659ae5
    • J
      tracing/ring_buffer: Try harder to allocate · 84861885
      Joel Fernandes 提交于
      ftrace can fail to allocate per-CPU ring buffer on systems with a large
      number of CPUs coupled while large amounts of cache happening in the
      page cache. Currently the ring buffer allocation doesn't retry in the VM
      implementation even if direct-reclaim made some progress but still
      wasn't able to find a free page. On retrying I see that the allocations
      almost always succeed. The retry doesn't happen because __GFP_NORETRY is
      used in the tracer to prevent the case where we might OOM, however if we
      drop __GFP_NORETRY, we risk destabilizing the system if OOM killer is
      triggered. To prevent this situation, use the __GFP_RETRY_MAYFAIL flag
      introduced recently [1].
      
      Tested the following still succeeds without destabilizing a system with
      1GB memory.
      echo 300000 > /sys/kernel/debug/tracing/buffer_size_kb
      
      [1] https://marc.info/?l=linux-mm&m=149820805124906&w=2
      
      Link: http://lkml.kernel.org/r/20170713021416.8897-1-joelaf@google.com
      
      Cc: Tim Murray <timmurray@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      84861885
  6. 18 7月, 2017 1 次提交
  7. 15 7月, 2017 2 次提交
  8. 14 7月, 2017 2 次提交
  9. 13 7月, 2017 18 次提交
  10. 12 7月, 2017 8 次提交
    • D
      ftrace: Fix uninitialized variable in match_records() · 2e028c4f
      Dan Carpenter 提交于
      My static checker complains that if "func" is NULL then "clear_filter"
      is uninitialized.  This seems like it could be true, although it's
      possible something subtle is happening that I haven't seen.
      
          kernel/trace/ftrace.c:3844 match_records()
          error: uninitialized symbol 'clear_filter'.
      
      Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda
      
      Cc: stable@vger.kernel.org
      Fixes: f0a3b154 ("ftrace: Clarify code for mod command")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2e028c4f
    • D
      ftrace: Remove an unneeded NULL check · 44925dff
      Dan Carpenter 提交于
      "func" can't be NULL and it doesn't make sense to check because we've
      already derefenced it.
      
      Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      44925dff
    • V
      cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race · ab2f7cf1
      Vikram Mulukutla 提交于
      With a shared policy in place, when one of the CPUs in the policy is
      hotplugged out and then brought back online, sugov_stop() and
      sugov_start() are called in order.
      
      sugov_stop() removes utilization hooks for each CPU in the policy and
      does nothing else in the for_each_cpu() loop. sugov_start() on the
      other hand iterates through the CPUs in the policy and re-initializes
      the per-cpu structure _and_ adds the utilization hook.  This implies
      that the scheduler is allowed to invoke a CPU's utilization update
      hook when the rest of the per-cpu structures have yet to be
      re-inited.
      
      Apart from some strange values in tracepoints this doesn't cause a
      problem, but if we do end up accessing a pointer from the per-cpu
      sugov_cpu structure somewhere in the sugov_update_shared() path,
      we will likely see crashes since the memset for another CPU in the
      policy is free to race with sugov_update_shared from the CPU that is
      ready to go.  So let's fix this now to first init all per-cpu
      structures, and then add the per-cpu utilization update hooks all at
      once.
      Signed-off-by: NVikram Mulukutla <markivx@codeaurora.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ab2f7cf1
    • T
      genirq: Keep chip buslock across irq_request/release_resources() · 19d39a38
      Thomas Gleixner 提交于
      Moving the irq_request/release_resources() callbacks out of the spinlocked,
      irq disabled and bus locked region, unearthed an interesting abuse of the
      irq_bus_lock/irq_bus_sync_unlock() callbacks.
      
      The OMAP GPIO driver does merily power management inside of them. The
      irq_request_resources() callback of this GPIO irqchip calls a function
      which reads a GPIO register. That read aborts now because the clock of the
      GPIO block is not magically enabled via the irq_bus_lock() callback.
      
      Move the callbacks under the bus lock again to prevent this. In the
      free_irq() path this requires to drop the bus_lock before calling
      synchronize_irq() and reaquiring it before calling the
      irq_release_resources() callback.
      
      The bus lock can't be held because:
      
         1) The data which has been changed between bus_lock/un_lock is cached in
            the irq chip driver private data and needs to go out to the irq chip
            via the slow bus (usually SPI or I2C) before calling
            synchronize_irq().
      
            That's the reason why this bus_lock/unlock magic exists in the first
            place, as you cannot do SPI/I2C transactions while holding desc->lock
            with interrupts disabled.
      
         2) synchronize_irq() will actually deadlock, if there is a handler on
            flight. These chips use threaded handlers for obvious reasons, as
            they allow to do SPI/I2C communication. When the threaded handler
            returns then bus_lock needs to be taken in irq_finalize_oneshot() as
            we need to talk to the actual irq chip once more. After that the
            threaded handler is marked done, which makes synchronize_irq() return.
      
            So if we hold bus_lock accross the synchronize_irq() call, the
            handler cannot mark itself done because it blocks on the bus
            lock. That in turn makes synchronize_irq() wait forever on the
            threaded handler to complete....
      
      Add the missing unlock of desc->request_mutex in the error path of
      __free_irq() and add a bunch of comments to explain the locking and
      protection rules.
      
      Fixes: 46e48e25 ("genirq: Move irq resource handling out of spinlocked region")
      Reported-and-tested-by: NSebastian Reichel <sebastian.reichel@collabora.co.uk>
      Reported-and-tested-by: NTony Lindgren <tony@atomide.com>
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Not-longer-ranted-at-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      19d39a38
    • A
      ftrace: Hide cached module code for !CONFIG_MODULES · 69449bbd
      Arnd Bergmann 提交于
      When modules are disabled, we get a harmless build warning:
      
      kernel/trace/ftrace.c:4051:13: error: 'process_cached_mods' defined but not used [-Werror=unused-function]
      
      This adds the same #ifdef around the new code that exists around
      its caller.
      
      Link: http://lkml.kernel.org/r/20170710084413.1820568-1-arnd@arndb.de
      
      Fixes: d7fbf8df ("ftrace: Implement cached modules tracing on module load")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      69449bbd
    • S
      tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE · bbd1d27d
      Steven Rostedt (VMware) 提交于
      The "stack_trace_filter" file only makes sense if DYNAMIC_FTRACE is
      configured in. If it is not, then the user can not filter any functions.
      
      Not only that, the open function causes warnings when DYNAMIC_FTRACE is not
      set.
      
      Link: http://lkml.kernel.org/r/20170710110521.600806-1-arnd@arndb.deReported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bbd1d27d
    • S
      tracing: Fixup trace file header alignment · b11fb737
      Steven Rostedt (VMware) 提交于
      The addition of TGID to the tracing header added a check to see if TGID
      shoudl be displayed or not, and updated the header accordingly.
      Unfortunately, it broke the default header.
      
      Also add constant strings to use for spacing. This does remove the
      visibility of the header a bit, but cuts it down from the extended lines
      much greater than 80 characters.
      
      Before this change:
      
       # tracer: function
       #
       #                            _-----=> irqs-off
       #                           / _----=> need-resched
       #                          | / _---=> hardirq/softirq
       #                          || / _--=> preempt-depth
       #                          ||| /     delay
       #           TASK-PID   CPU#||||    TIMESTAMP  FUNCTION
       #              | |       | ||||       |         |
              swapper/0-1     [000] ....     0.277830: migration_init <-do_one_initcall
              swapper/0-1     [002] d...    13.861967: Unknown type 1201
              swapper/0-1     [002] d..1    13.861970: Unknown type 1202
      
      After this change:
      
       # tracer: function
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
              swapper/0-1     [000] ....     0.278245: migration_init <-do_one_initcall
              swapper/0-1     [003] d...    13.861189: Unknown type 1201
              swapper/0-1     [003] d..1    13.861192: Unknown type 1202
      
      Cc: Joel Fernandes <joelaf@google.com>
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b11fb737
    • T
      smp/hotplug: Replace BUG_ON and react useful · dea1d0f5
      Thomas Gleixner 提交于
      The move of the unpark functions to the control thread moved the BUG_ON()
      there as well. While it made some sense in the idle thread of the upcoming
      CPU, it's bogus to crash the control thread on the already online CPU,
      especially as the function has a return value and the callsite is prepared
      to handle an error return.
      
      Replace it with a WARN_ON_ONCE() and return a proper error code.
      
      Fixes: 9cd4f1a4 ("smp/hotplug: Move unparking of percpu threads to the control CPU")
      Rightfully-ranted-at-by: NLinux Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      dea1d0f5