1. 15 8月, 2017 5 次提交
    • T
      seccomp: Filter flag to log all actions except SECCOMP_RET_ALLOW · e66a3997
      Tyler Hicks 提交于
      Add a new filter flag, SECCOMP_FILTER_FLAG_LOG, that enables logging for
      all actions except for SECCOMP_RET_ALLOW for the given filter.
      
      SECCOMP_RET_KILL actions are always logged, when "kill" is in the
      actions_logged sysctl, and SECCOMP_RET_ALLOW actions are never logged,
      regardless of this flag.
      
      This flag can be used to create noisy filters that result in all
      non-allowed actions to be logged. A process may have one noisy filter,
      which is loaded with this flag, as well as a quiet filter that's not
      loaded with this flag. This allows for the actions in a set of filters
      to be selectively conveyed to the admin.
      
      Since a system could have a large number of allocated seccomp_filter
      structs, struct packing was taken in consideration. On 64 bit x86, the
      new log member takes up one byte of an existing four byte hole in the
      struct. On 32 bit x86, the new log member creates a new four byte hole
      (unavoidable) and consumes one of those bytes.
      
      Unfortunately, the tests added for SECCOMP_FILTER_FLAG_LOG are not
      capable of inspecting the audit log to verify that the actions taken in
      the filter were logged.
      
      With this patch, the logic for deciding if an action will be logged is:
      
      if action == RET_ALLOW:
        do not log
      else if action == RET_KILL && RET_KILL in actions_logged:
        log
      else if filter-requests-logging && action in actions_logged:
        log
      else if audit_enabled && process-is-being-audited:
        log
      else:
        do not log
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e66a3997
    • T
      seccomp: Sysctl to configure actions that are allowed to be logged · 0ddec0fc
      Tyler Hicks 提交于
      Adminstrators can write to this sysctl to set the seccomp actions that
      are allowed to be logged. Any actions not found in this sysctl will not
      be logged.
      
      For example, all SECCOMP_RET_KILL, SECCOMP_RET_TRAP, and
      SECCOMP_RET_ERRNO actions would be loggable if "kill trap errno" were
      written to the sysctl. SECCOMP_RET_TRACE actions would not be logged
      since its string representation ("trace") wasn't present in the sysctl
      value.
      
      The path to the sysctl is:
      
       /proc/sys/kernel/seccomp/actions_logged
      
      The actions_avail sysctl can be read to discover the valid action names
      that can be written to the actions_logged sysctl with the exception of
      "allow". SECCOMP_RET_ALLOW actions cannot be configured for logging.
      
      The default setting for the sysctl is to allow all actions to be logged
      except SECCOMP_RET_ALLOW. While only SECCOMP_RET_KILL actions are
      currently logged, an upcoming patch will allow applications to request
      additional actions to be logged.
      
      There's one important exception to this sysctl. If a task is
      specifically being audited, meaning that an audit context has been
      allocated for the task, seccomp will log all actions other than
      SECCOMP_RET_ALLOW despite the value of actions_logged. This exception
      preserves the existing auditing behavior of tasks with an allocated
      audit context.
      
      With this patch, the logic for deciding if an action will be logged is:
      
      if action == RET_ALLOW:
        do not log
      else if action == RET_KILL && RET_KILL in actions_logged:
        log
      else if audit_enabled && task-is-being-audited:
        log
      else:
        do not log
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0ddec0fc
    • T
      seccomp: Operation for checking if an action is available · d612b1fd
      Tyler Hicks 提交于
      Userspace code that needs to check if the kernel supports a given action
      may not be able to use the /proc/sys/kernel/seccomp/actions_avail
      sysctl. The process may be running in a sandbox and, therefore,
      sufficient filesystem access may not be available. This patch adds an
      operation to the seccomp(2) syscall that allows userspace code to ask
      the kernel if a given action is available.
      
      If the action is supported by the kernel, 0 is returned. If the action
      is not supported by the kernel, -1 is returned with errno set to
      -EOPNOTSUPP. If this check is attempted on a kernel that doesn't support
      this new operation, -1 is returned with errno set to -EINVAL meaning
      that userspace code will have the ability to differentiate between the
      two error cases.
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      d612b1fd
    • T
      seccomp: Sysctl to display available actions · 8e5f1ad1
      Tyler Hicks 提交于
      This patch creates a read-only sysctl containing an ordered list of
      seccomp actions that the kernel supports. The ordering, from left to
      right, is the lowest action value (kill) to the highest action value
      (allow). Currently, a read of the sysctl file would return "kill trap
      errno trace allow". The contents of this sysctl file can be useful for
      userspace code as well as the system administrator.
      
      The path to the sysctl is:
      
        /proc/sys/kernel/seccomp/actions_avail
      
      libseccomp and other userspace code can easily determine which actions
      the current kernel supports. The set of actions supported by the current
      kernel may be different than the set of action macros found in kernel
      headers that were installed where the userspace code was built.
      
      In addition, this sysctl will allow system administrators to know which
      actions are supported by the kernel and make it easier to configure
      exactly what seccomp logs through the audit subsystem. Support for this
      level of logging configuration will come in a future patch.
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      8e5f1ad1
    • K
      seccomp: Provide matching filter for introspection · deb4de8b
      Kees Cook 提交于
      Both the upcoming logging improvements and changes to RET_KILL will need
      to know which filter a given seccomp return value originated from. In
      order to delay logic processing of result until after the seccomp loop,
      this adds a single pointer assignment on matches. This will allow both
      log and RET_KILL logic to work off the filter rather than doing more
      expensive tests inside the time-critical run_filters loop.
      
      Running tight cycles of getpid() with filters attached shows no measurable
      difference in speed.
      Suggested-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      deb4de8b
  2. 21 7月, 2017 2 次提交
    • J
      perf/core: Fix locking for children siblings group read · 2aeb1883
      Jiri Olsa 提交于
      We're missing ctx lock when iterating children siblings
      within the perf_read path for group reading. Following
      race and crash can happen:
      
      User space doing read syscall on event group leader:
      
      T1:
        perf_read
          lock event->ctx->mutex
          perf_read_group
            lock leader->child_mutex
            __perf_read_group_add(child)
              list_for_each_entry(sub, &leader->sibling_list, group_entry)
      
      ---->   sub might be invalid at this point, because it could
              get removed via perf_event_exit_task_context in T2
      
      Child exiting and cleaning up its events:
      
      T2:
        perf_event_exit_task_context
          lock ctx->mutex
          list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
            perf_event_exit_event(child)
              lock ctx->lock
              perf_group_detach(child)
              unlock ctx->lock
      
      ---->   child is removed from sibling_list without any sync
              with T1 path above
      
              ...
              free_event(child)
      
      Before the child is removed from the leader's child_list,
      (and thus is omitted from perf_read_group processing), we
      need to ensure that perf_read_group touches child's
      siblings under its ctx->lock.
      
      Peter further notes:
      
      | One additional note; this bug got exposed by commit:
      |
      |   ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      |
      | which made it possible to actually trigger this code-path.
      Tested-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aeb1883
    • D
      bpf: fix mixed signed/unsigned derived min/max value bounds · 4cabc5b1
      Daniel Borkmann 提交于
      Edward reported that there's an issue in min/max value bounds
      tracking when signed and unsigned compares both provide hints
      on limits when having unknown variables. E.g. a program such
      as the following should have been rejected:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff8a94cda93400
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        11: (65) if r1 s> 0x1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      What happens is that in the first part ...
      
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
      
      ... r1 carries an unsigned value, and is compared as unsigned
      against a register carrying an immediate. Verifier deduces in
      reg_set_min_max() that since the compare is unsigned and operation
      is greater than (>), that in the fall-through/false case, r1's
      minimum bound must be 0 and maximum bound must be r2. Latter is
      larger than the bound and thus max value is reset back to being
      'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
      'R1=inv,min_value=0'. The subsequent test ...
      
        11: (65) if r1 s> 0x1 goto pc+2
      
      ... is a signed compare of r1 with immediate value 1. Here,
      verifier deduces in reg_set_min_max() that since the compare
      is signed this time and operation is greater than (>), that
      in the fall-through/false case, we can deduce that r1's maximum
      bound must be 1, meaning with prior test, we result in r1 having
      the following state: R1=inv,min_value=0,max_value=1. Given that
      the actual value this holds is -8, the bounds are wrongly deduced.
      When this is being added to r0 which holds the map_value(_adj)
      type, then subsequent store access in above case will go through
      check_mem_access() which invokes check_map_access_adj(), that
      will then probe whether the map memory is in bounds based
      on the min_value and max_value as well as access size since
      the actual unknown value is min_value <= x <= max_value; commit
      fce366a9 ("bpf, verifier: fix alu ops against map_value{,
      _adj} register types") provides some more explanation on the
      semantics.
      
      It's worth to note in this context that in the current code,
      min_value and max_value tracking are used for two things, i)
      dynamic map value access via check_map_access_adj() and since
      commit 06c1c049 ("bpf: allow helpers access to variable memory")
      ii) also enforced at check_helper_mem_access() when passing a
      memory address (pointer to packet, map value, stack) and length
      pair to a helper and the length in this case is an unknown value
      defining an access range through min_value/max_value in that
      case. The min_value/max_value tracking is /not/ used in the
      direct packet access case to track ranges. However, the issue
      also affects case ii), for example, the following crafted program
      based on the same principle must be rejected as well:
      
         0: (b7) r2 = 0
         1: (bf) r3 = r10
         2: (07) r3 += -512
         3: (7a) *(u64 *)(r10 -16) = -8
         4: (79) r4 = *(u64 *)(r10 -16)
         5: (b7) r6 = -1
         6: (2d) if r4 > r6 goto pc+5
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
         7: (65) if r4 s> 0x1 goto pc+4
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1
        R10=fp
         8: (07) r4 += 1
         9: (b7) r5 = 0
        10: (6a) *(u16 *)(r10 -512) = 0
        11: (85) call bpf_skb_load_bytes#26
        12: (b7) r0 = 0
        13: (95) exit
      
      Meaning, while we initialize the max_value stack slot that the
      verifier thinks we access in the [1,2] range, in reality we
      pass -7 as length which is interpreted as u32 in the helper.
      Thus, this issue is relevant also for the case of helper ranges.
      Resetting both bounds in check_reg_overflow() in case only one
      of them exceeds limits is also not enough as similar test can be
      created that uses values which are within range, thus also here
      learned min value in r1 is incorrect when mixed with later signed
      test to create a range:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff880ad081fa00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (65) if r1 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      This leaves us with two options for fixing this: i) to invalidate
      all prior learned information once we switch signed context, ii)
      to track min/max signed and unsigned boundaries separately as
      done in [0]. (Given latter introduces major changes throughout
      the whole verifier, it's rather net-next material, thus this
      patch follows option i), meaning we can derive bounds either
      from only signed tests or only unsigned tests.) There is still the
      case of adjust_reg_min_max_vals(), where we adjust bounds on ALU
      operations, meaning programs like the following where boundaries
      on the reg get mixed in context later on when bounds are merged
      on the dst reg must get rejected, too:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff89b2bf87ce00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+6
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (b7) r7 = 1
        12: (65) if r7 s> 0x0 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
        13: (b7) r0 = 0
        14: (95) exit
      
        from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
        15: (0f) r7 += r1
        16: (65) if r7 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        17: (0f) r0 += r7
        18: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        19: (b7) r0 = 0
        20: (95) exit
      
      Meaning, in adjust_reg_min_max_vals() we must also reset range
      values on the dst when src/dst registers have mixed signed/
      unsigned derived min/max value bounds with one unbounded value
      as otherwise they can be added together deducing false boundaries.
      Once both boundaries are established from either ALU ops or
      compare operations w/o mixing signed/unsigned insns, then they
      can safely be added to other regs also having both boundaries
      established. Adding regs with one unbounded side to a map value
      where the bounded side has been learned w/o mixing ops is
      possible, but the resulting map value won't recover from that,
      meaning such op is considered invalid on the time of actual
      access. Invalid bounds are set on the dst reg in case i) src reg,
      or ii) in case dst reg already had them. The only way to recover
      would be to perform i) ALU ops but only 'add' is allowed on map
      value types or ii) comparisons, but these are disallowed on
      pointers in case they span a range. This is fine as only BPF_JEQ
      and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
      which potentially turn them into PTR_TO_MAP_VALUE type depending
      on the branch, so only here min/max value cannot be invalidated
      for them.
      
      In terms of state pruning, value_from_signed is considered
      as well in states_equal() when dealing with adjusted map values.
      With regards to breaking existing programs, there is a small
      risk, but use-cases are rather quite narrow where this could
      occur and mixing compares probably unlikely.
      
      Joint work with Josef and Edward.
      
        [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Reported-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cabc5b1
  3. 20 7月, 2017 3 次提交
    • C
      trace: fix the errors caused by incompatible type of RCU variables · f86f4180
      Chunyan Zhang 提交于
      The variables which are processed by RCU functions should be annotated
      as RCU, otherwise sparse will report the errors like below:
      
      "error: incompatible types in comparison expression (different
      address spaces)"
      
      Link: http://lkml.kernel.org/r/1496823171-7758-1-git-send-email-zhang.chunyan@linaro.orgSigned-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
      [ Updated to not be 100% 80 column strict ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f86f4180
    • C
      tracing: Fix kmemleak in instance_rmdir · db9108e0
      Chunyu Hu 提交于
      Hit the kmemleak when executing instance_rmdir, it forgot releasing
      mem of tracing_cpumask. With this fix, the warn does not appear any
      more.
      
      unreferenced object 0xffff93a8dfaa7c18 (size 8):
        comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s)
        hex dump (first 8 bytes):
          ff ff ff ff ff ff ff ff                          ........
        backtrace:
          [<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280
          [<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30
          [<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10
          [<ffffffff88571ab0>] instance_mkdir+0x90/0x240
          [<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70
          [<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0
          [<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100
          [<ffffffff88403857>] do_syscall_64+0x67/0x150
          [<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: ccfe9e42 ("tracing: Make tracing_cpumask available for all instances")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      db9108e0
    • A
      perf/core: Fix scheduling regression of pinned groups · 3bda69c1
      Alexander Shishkin 提交于
      Vince Weaver reported:
      
      > I was tracking down some regressions in my perf_event_test testsuite.
      > Some of the tests broke in the 4.11-rc1 timeframe.
      >
      > I've bisected one of them, this report is about
      >	tests/overflow/simul_oneshot_group_overflow
      > This test creates an event group containing two sampling events, set
      > to overflow to a signal handler (which disables and then refreshes the
      > event).
      >
      > On a good kernel you get the following:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 946 (perf::instructions/1000000)
      > 		fd 4 overflows: 473 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946379875
      > 		Count 1: 946365218
      >
      > With the broken kernels you get:
      > 	Event perf::instructions with period 1000000
      > 	Event perf::instructions with period 2000000
      > 		fd 3 overflows: 938 (perf::instructions/1000000)
      > 		fd 4 overflows: 318 (perf::instructions/2000000)
      > 	Ending counts:
      > 		Count 0: 946373080
      > 		Count 1: 653373058
      
      The root cause of the bug is that the following commit:
      
        487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      
      erronously assumed that event's 'pinned' setting determines whether the
      event belongs to a pinned group or not, but in fact, it's the group
      leader's pinned state that matters.
      
      This was discovered by Vince in the test case described above, where two instruction
      counters are grouped, the group leader is pinned, but the other event is not;
      in the regressed case the counters were off by 33% (the difference between events'
      periods), but should be the same within the error margin.
      
      Fix the problem by looking at the group leader's pinning.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/87lgnmvw7h.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3bda69c1
  4. 19 7月, 2017 2 次提交
    • S
      audit: fix memleak in auditd_send_unicast_skb. · b0659ae5
      Shu Wang 提交于
      Found this issue by kmemleak report, auditd_send_unicast_skb
      did not free skb if rcu_dereference(auditd_conn) returns null.
      
      unreferenced object 0xffff88082568ce00 (size 256):
      comm "auditd", pid 1119, jiffies 4294708499
      backtrace:
      [<ffffffff8176166a>] kmemleak_alloc+0x4a/0xa0
      [<ffffffff8121820c>] kmem_cache_alloc_node+0xcc/0x210
      [<ffffffff8161b99d>] __alloc_skb+0x5d/0x290
      [<ffffffff8113c614>] audit_make_reply+0x54/0xd0
      [<ffffffff8113dfa7>] audit_receive_msg+0x967/0xd70
      ----------------
      (gdb) list *audit_receive_msg+0x967
      0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133).
      1132    skb = audit_make_reply(0, AUDIT_REPLACE, 0,
                                      0, &pvnr, sizeof(pvnr));
      ---------------
      [<ffffffff8113e402>] audit_receive+0x52/0xa0
      [<ffffffff8166c561>] netlink_unicast+0x181/0x240
      [<ffffffff8166c8e2>] netlink_sendmsg+0x2c2/0x3b0
      [<ffffffff816112e8>] sock_sendmsg+0x38/0x50
      [<ffffffff816117a2>] SYSC_sendto+0x102/0x190
      [<ffffffff81612f4e>] SyS_sendto+0xe/0x10
      [<ffffffff8176d337>] entry_SYSCALL_64_fastpath+0x1a/0xa5
      [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b0659ae5
    • J
      tracing/ring_buffer: Try harder to allocate · 84861885
      Joel Fernandes 提交于
      ftrace can fail to allocate per-CPU ring buffer on systems with a large
      number of CPUs coupled while large amounts of cache happening in the
      page cache. Currently the ring buffer allocation doesn't retry in the VM
      implementation even if direct-reclaim made some progress but still
      wasn't able to find a free page. On retrying I see that the allocations
      almost always succeed. The retry doesn't happen because __GFP_NORETRY is
      used in the tracer to prevent the case where we might OOM, however if we
      drop __GFP_NORETRY, we risk destabilizing the system if OOM killer is
      triggered. To prevent this situation, use the __GFP_RETRY_MAYFAIL flag
      introduced recently [1].
      
      Tested the following still succeeds without destabilizing a system with
      1GB memory.
      echo 300000 > /sys/kernel/debug/tracing/buffer_size_kb
      
      [1] https://marc.info/?l=linux-mm&m=149820805124906&w=2
      
      Link: http://lkml.kernel.org/r/20170713021416.8897-1-joelaf@google.com
      
      Cc: Tim Murray <timmurray@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      84861885
  5. 18 7月, 2017 1 次提交
  6. 15 7月, 2017 2 次提交
  7. 14 7月, 2017 2 次提交
  8. 13 7月, 2017 18 次提交
  9. 12 7月, 2017 5 次提交
    • D
      ftrace: Fix uninitialized variable in match_records() · 2e028c4f
      Dan Carpenter 提交于
      My static checker complains that if "func" is NULL then "clear_filter"
      is uninitialized.  This seems like it could be true, although it's
      possible something subtle is happening that I haven't seen.
      
          kernel/trace/ftrace.c:3844 match_records()
          error: uninitialized symbol 'clear_filter'.
      
      Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda
      
      Cc: stable@vger.kernel.org
      Fixes: f0a3b154 ("ftrace: Clarify code for mod command")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2e028c4f
    • D
      ftrace: Remove an unneeded NULL check · 44925dff
      Dan Carpenter 提交于
      "func" can't be NULL and it doesn't make sense to check because we've
      already derefenced it.
      
      Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      44925dff
    • V
      cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race · ab2f7cf1
      Vikram Mulukutla 提交于
      With a shared policy in place, when one of the CPUs in the policy is
      hotplugged out and then brought back online, sugov_stop() and
      sugov_start() are called in order.
      
      sugov_stop() removes utilization hooks for each CPU in the policy and
      does nothing else in the for_each_cpu() loop. sugov_start() on the
      other hand iterates through the CPUs in the policy and re-initializes
      the per-cpu structure _and_ adds the utilization hook.  This implies
      that the scheduler is allowed to invoke a CPU's utilization update
      hook when the rest of the per-cpu structures have yet to be
      re-inited.
      
      Apart from some strange values in tracepoints this doesn't cause a
      problem, but if we do end up accessing a pointer from the per-cpu
      sugov_cpu structure somewhere in the sugov_update_shared() path,
      we will likely see crashes since the memset for another CPU in the
      policy is free to race with sugov_update_shared from the CPU that is
      ready to go.  So let's fix this now to first init all per-cpu
      structures, and then add the per-cpu utilization update hooks all at
      once.
      Signed-off-by: NVikram Mulukutla <markivx@codeaurora.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ab2f7cf1
    • T
      genirq: Keep chip buslock across irq_request/release_resources() · 19d39a38
      Thomas Gleixner 提交于
      Moving the irq_request/release_resources() callbacks out of the spinlocked,
      irq disabled and bus locked region, unearthed an interesting abuse of the
      irq_bus_lock/irq_bus_sync_unlock() callbacks.
      
      The OMAP GPIO driver does merily power management inside of them. The
      irq_request_resources() callback of this GPIO irqchip calls a function
      which reads a GPIO register. That read aborts now because the clock of the
      GPIO block is not magically enabled via the irq_bus_lock() callback.
      
      Move the callbacks under the bus lock again to prevent this. In the
      free_irq() path this requires to drop the bus_lock before calling
      synchronize_irq() and reaquiring it before calling the
      irq_release_resources() callback.
      
      The bus lock can't be held because:
      
         1) The data which has been changed between bus_lock/un_lock is cached in
            the irq chip driver private data and needs to go out to the irq chip
            via the slow bus (usually SPI or I2C) before calling
            synchronize_irq().
      
            That's the reason why this bus_lock/unlock magic exists in the first
            place, as you cannot do SPI/I2C transactions while holding desc->lock
            with interrupts disabled.
      
         2) synchronize_irq() will actually deadlock, if there is a handler on
            flight. These chips use threaded handlers for obvious reasons, as
            they allow to do SPI/I2C communication. When the threaded handler
            returns then bus_lock needs to be taken in irq_finalize_oneshot() as
            we need to talk to the actual irq chip once more. After that the
            threaded handler is marked done, which makes synchronize_irq() return.
      
            So if we hold bus_lock accross the synchronize_irq() call, the
            handler cannot mark itself done because it blocks on the bus
            lock. That in turn makes synchronize_irq() wait forever on the
            threaded handler to complete....
      
      Add the missing unlock of desc->request_mutex in the error path of
      __free_irq() and add a bunch of comments to explain the locking and
      protection rules.
      
      Fixes: 46e48e25 ("genirq: Move irq resource handling out of spinlocked region")
      Reported-and-tested-by: NSebastian Reichel <sebastian.reichel@collabora.co.uk>
      Reported-and-tested-by: NTony Lindgren <tony@atomide.com>
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Not-longer-ranted-at-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      19d39a38
    • A
      ftrace: Hide cached module code for !CONFIG_MODULES · 69449bbd
      Arnd Bergmann 提交于
      When modules are disabled, we get a harmless build warning:
      
      kernel/trace/ftrace.c:4051:13: error: 'process_cached_mods' defined but not used [-Werror=unused-function]
      
      This adds the same #ifdef around the new code that exists around
      its caller.
      
      Link: http://lkml.kernel.org/r/20170710084413.1820568-1-arnd@arndb.de
      
      Fixes: d7fbf8df ("ftrace: Implement cached modules tracing on module load")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      69449bbd