1. 06 11月, 2015 15 次提交
  2. 04 11月, 2015 8 次提交
    • P
      audit: make audit_log_common_recv_msg() a void function · 233a6866
      Paul Moore 提交于
      It always returns zero and no one is checking the return value.
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      233a6866
    • S
      audit: removing unused variable · c5ea6efd
      Saurabh Sengar 提交于
      Variable rc in not required as it is just used for unchanged for return,
      and return is always 0 in the function.
      Signed-off-by: NSaurabh Sengar <saurabh.truth@gmail.com>
      [PM: fixed spelling errors in description]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      c5ea6efd
    • S
      audit: fix comment block whitespace · 725131ef
      Scott Matheina 提交于
      Signed-off-by: NScott Matheina <scott@matheina.com>
      [PM: fixed subject line]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      725131ef
    • Y
      audit: audit_tree_match can be boolean · 6f1b5d7a
      Yaowei Bai 提交于
      This patch makes audit_tree_match return bool to improve readability
      due to this particular function only using either one or zero as its
      return value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      [PM: tweaked the subject line]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      6f1b5d7a
    • Y
      audit: audit_string_contains_control can be boolean · 9fcf836b
      Yaowei Bai 提交于
      This patch makes audit_string_contains_control return bool to improve
      readability due to this particular function only using either one or
      zero as its return value.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      [PM: tweaked subject line]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      9fcf836b
    • R
      audit: try harder to send to auditd upon netlink failure · 32a1dbae
      Richard Guy Briggs 提交于
      There are several reports of the kernel losing contact with auditd when
      it is, in fact, still running.  When this happens, kernel syslogs show:
      	"audit: *NO* daemon at audit_pid=<pid>"
      although auditd is still running, and is apparently happy, listening on
      the netlink socket. The pid in the "*NO* daemon" message matches the pid
      of the running auditd process.  Restarting auditd solves this.
      
      The problem appears to happen randomly, and doesn't seem to be strongly
      correlated to the rate of audit events being logged.  The problem
      happens fairly regularly (every few days), but not yet reproduced to
      order.
      
      On production kernels, BUG_ON() is a no-op, so any error will trigger
      this.
      
      Commit 34eab0a7 ("audit: prevent an older auditd shutdown from
      orphaning a newer auditd startup") eliminates one possible cause.  This
      isn't the case here, since the PID in the error message and the PID of
      the running auditd match.
      
      The primary expected cause of error here is -ECONNREFUSED when the audit
      daemon goes away, when netlink_getsockbyportid() can't find the auditd
      portid entry in the netlink audit table (or there is no receive
      function).  If -EPERM is returned, that situation isn't likely to be
      resolved in a timely fashion without administrator intervention.  In
      both cases, reset the audit_pid.  This does not rule out a race
      condition.  SELinux is expected to return zero since this isn't an INET
      or INET6 socket.  Other LSMs may have other return codes.  Log the error
      code for better diagnosis in the future.
      
      In the case of -ENOMEM, the situation could be temporary, based on local
      or general availability of buffers.  -EAGAIN should never happen since
      the netlink audit (kernel) socket is set to MAX_SCHEDULE_TIMEOUT.
      -ERESTARTSYS and -EINTR are not expected since this kernel thread is not
      expected to receive signals.  In these cases (or any other unexpected
      ones for now), report the error and re-schedule the thread, retrying up
      to 5 times.
      
      v2:
      	Removed BUG_ON().
      	Moved comma in pr_*() statements.
      	Removed audit_strerror() text.
      Reported-by: NVipin Rathor <v.rathor@gmail.com>
      Reported-by: <ctcard@hotmail.com>
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      [PM: applied rgb's fixup patch to correct audit_log_lost() format issues]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      32a1dbae
    • L
      atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() · 105ff3cb
      Linus Torvalds 提交于
      This seems to be a mis-reading of how alpha memory ordering works, and
      is not backed up by the alpha architecture manual.  The helper functions
      don't do anything special on any other architectures, and the arguments
      that support them being safe on other architectures also argue that they
      are safe on alpha.
      
      Basically, the "control dependency" is between a previous read and a
      subsequent write that is dependent on the value read.  Even if the
      subsequent write is actually done speculatively, there is no way that
      such a speculative write could be made visible to other cpu's until it
      has been committed, which requires validating the speculation.
      
      Note that most weakely ordered architectures (very much including alpha)
      do not guarantee any ordering relationship between two loads that depend
      on each other on a control dependency:
      
          read A
          if (val == 1)
              read B
      
      because the conditional may be predicted, and the "read B" may be
      speculatively moved up to before reading the value A.  So we require the
      user to insert a smp_rmb() between the two accesses to be correct:
      
          read A;
          if (A == 1)
              smp_rmb()
              read B
      
      Alpha is further special in that it can break that ordering even if the
      *address* of B depends on the read of A, because the cacheline that is
      read later may be stale unless you have a memory barrier in between the
      pointer read and the read of the value behind a pointer:
      
          read ptr
          read offset(ptr)
      
      whereas all other weakly ordered architectures guarantee that the data
      dependency (as opposed to just a control dependency) will order the two
      accesses.  As a result, alpha needs a "smp_read_barrier_depends()" in
      between those two reads for them to be ordered.
      
      The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
      had was a control dependency to a subsequent *write*, however, and
      nobody can finalize such a subsequent write without having actually done
      the read.  And were you to write such a value to a "stale" cacheline
      (the way the unordered reads came to be), that would seem to lose the
      write entirely.
      
      So the things that make alpha able to re-order reads even more
      aggressively than other weak architectures do not seem to be relevant
      for a subsequent write.  Alpha memory ordering may be strange, but
      there's no real indication that it is *that* strange.
      
      Also, the alpha architecture reference manual very explicitly talks
      about the definition of "Dependence Constraints" in section 5.6.1.7,
      where a preceding read dominates a subsequent write.
      
      Such a dependence constraint admittedly does not impose a BEFORE (alpha
      architecture term for globally visible ordering), but it does guarantee
      that there can be no "causal loop".  I don't see how you could avoid
      such a loop if another cpu could see the stored value and then impact
      the value of the first read.  Put another way: the read and the write
      could not be seen as being out of order wrt other cpus.
      
      So I do not see how these "x_ctrl()" functions can currently be necessary.
      
      I may have to eat my words at some point, but in the absense of clear
      proof that alpha actually needs this, or indeed even an explanation of
      how alpha could _possibly_ need it, I do not believe these functions are
      called for.
      
      And if it turns out that alpha really _does_ need a barrier for this
      case, that barrier still should not be "smp_read_barrier_depends()".
      We'd have to make up some new speciality barrier just for alpha, along
      with the documentation for why it really is necessary.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul E McKenney <paulmck@us.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      105ff3cb
    • D
      bpf, verifier: annotate verbose printer with __printf · 1d056d9c
      Daniel Borkmann 提交于
      The verbose() printer dumps the verifier state to user space, so let gcc
      take care to check calls to verbose() for (future) errors. make with W=1
      correctly suggests: function might be possible candidate for 'gnu_printf'
      format attribute [-Wsuggest-attribute=format].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d056d9c
  3. 03 11月, 2015 5 次提交
    • D
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann 提交于
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2197755
    • D
      bpf: consolidate bpf_prog_put{, _rcu} dismantle paths · e9d8afa9
      Daniel Borkmann 提交于
      We currently have duplicated cleanup code in bpf_prog_put() and
      bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
      not worth it to make it a common helper called by both, but with
      the recent addition of resource charging, we could have avoided
      the fix in commit ac00737f ("bpf: Need to call bpf_prog_uncharge_memlock
      from bpf_prog_put") if we would have had only a single, common path.
      We can simplify it further by assigning aux->prog only once during
      allocation time.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9d8afa9
    • D
      bpf: align and clean bpf_{map,prog}_get helpers · c2101297
      Daniel Borkmann 提交于
      Add a bpf_map_get() function that we're going to use later on and
      align/clean the remaining helpers a bit so that we have them a bit
      more consistent:
      
        - __bpf_map_get() and __bpf_prog_get() that both work on the fd
          struct, check whether the descriptor is eBPF and return the
          pointer to the map/prog stored in the private data.
      
          Also, we can return f.file->private_data directly, the function
          signature is enough of a documentation already.
      
        - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
          call their respective __bpf_map_get()/__bpf_prog_get() variants,
          and take a reference.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2101297
    • D
      bpf: abstract anon_inode_getfd invocations · aa79781b
      Daniel Borkmann 提交于
      Since we're going to use anon_inode_getfd() invocations in more than just
      the current places, make a helper function for both, so that we only need
      to pass a map/prog pointer to the helper itself in order to get a fd. The
      new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa79781b
    • Y
      bpf: convert hashtab lock to raw lock · ac00881f
      Yang Shi 提交于
      When running bpf samples on rt kernel, it reports the below warning:
      
      BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
      in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
      Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      
      CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
      Hardware name: Freescale Layerscape 2085a RDB Board (DT)
      Call trace:
      [<ffff80000008a5b0>] dump_backtrace+0x0/0x128
      [<ffff80000008a6f8>] show_stack+0x20/0x30
      [<ffff8000007da90c>] dump_stack+0x7c/0xa0
      [<ffff8000000e4830>] ___might_sleep+0x188/0x1a0
      [<ffff8000007e2200>] rt_spin_lock+0x28/0x40
      [<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320
      [<ffff80000018c718>] bpf_map_update_elem+0x40/0x58
      [<ffff800000187658>] __bpf_prog_run+0xd48/0x1640
      [<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100
      [<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      [<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58
      [<ffff8000007e399c>] kprobe_handler+0x114/0x250
      [<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30
      [<ffff800000085b80>] brk_handler+0x88/0x98
      [<ffff8000000822f0>] do_debug_exception+0x50/0xb8
      Exception stack(0xffff808349687460 to 0xffff808349687580)
      7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
      7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
      74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
      74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
      74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
      7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
      7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
      7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
      7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77
      
      Convert hashtab lock to raw lock to avoid such warning.
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac00881f
  4. 30 10月, 2015 1 次提交
    • D
      blktrace: re-write setting q->blk_trace · cdea01b2
      Davidlohr Bueso 提交于
      This is really about simplifying the double xchg patterns into
      a single cmpxchg, with the same logic. Other than the immediate
      cleanup, there are some subtleties this change deals with:
      
      (i) While the load of the old bt is fully ordered wrt everything,
      ie:
      
              old_bt = xchg(&q->blk_trace, bt);             [barrier]
              if (old_bt)
      	     (void) xchg(&q->blk_trace, old_bt);    [barrier]
      
      blk_trace could still be changed between the xchg and the old_bt
      load. Note that this description is merely theoretical and afaict
      very small, but doing everything in a single context with cmpxchg
      closes this potential race.
      
      (ii) Ordering guarantees are obviously kept with cmpxchg.
      
      (iii) Gets rid of the hacky-by-nature (void)xchg pattern.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      eviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cdea01b2
  5. 29 10月, 2015 1 次提交
    • T
      cgroup: fix race condition around termination check in css_task_iter_next() · d5745675
      Tejun Heo 提交于
      css_task_iter_next() checked @it->cur_task before grabbing
      css_set_lock and assumed that the result won't change afterwards;
      however, tasks could leave the cgroup being iterated terminating the
      iterator before css_task_lock is acquired.  If this happens,
      css_task_iter_next() tries to calculate the current task from NULL
      cg_list pointer leading to the following oops.
      
       BUG: unable to handle kernel paging request at fffffffffffff7d0
       IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
       ...
       CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1
       Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014
       task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000
       RIP: 0010:[<ffffffff810d5f22>]  [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
       RSP: 0018:ffff88083404fd28  EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0
       RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278
       RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0
       R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400
       R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58
       FS:  00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0
       Stack:
        0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248
        ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee
        0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000
       Call Trace:
        [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550
        [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20
        [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0
        [<ffffffff811cab76>] seq_read+0x166/0x380
        [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180
        [<ffffffff811a7398>] __vfs_read+0x18/0x50
        [<ffffffff811a745d>] vfs_read+0x8d/0x150
        [<ffffffff811a756f>] SyS_read+0x4f/0xb0
        [<ffffffff818d4772>] system_call_fastpath+0x12/0x17
      
      Fix it by moving the termination condition check inside css_set_lock.
      @it->cur_task is now cleared after being put and @it->task_pos is
      tested for termination instead of @it->cset_pos as they indicate the
      same condition and @it->task_pos is what's being dereferenced.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NCalvin Owens <calvinowens@fb.com>
      Fixes: ed27b9f7 ("cgroup: don't hold css_set_rwsem across css task iteration")
      Acked-by: NZefan Li <lizefan@huawei.com>
      d5745675
  6. 28 10月, 2015 1 次提交
    • T
      seccomp, ptrace: add support for dumping seccomp filters · f8e529ed
      Tycho Andersen 提交于
      This patch adds support for dumping a process' (classic BPF) seccomp
      filters via ptrace.
      
      PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
      seccomp filters. addr should be an integer which represents the ith seccomp
      filter (0 is the most recently installed filter). data should be a struct
      sock_filter * with enough room for the ith filter, or NULL, in which case
      the filter is not saved. The return value for this command is the number of
      BPF instructions the program represents, or negative in the case of errors.
      Command specific errors are ENOENT: which indicates that there is no ith
      filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
      filter was not installed as a classic BPF filter.
      
      A caveat with this approach is that there is no way to get explicitly at
      the heirarchy of seccomp filters, and users need to memcmp() filters to
      decide which are inherited. This means that a task which installs two of
      the same filter can potentially confuse users of this interface.
      
      v2: * make save_orig const
          * check that the orig_prog exists (not necessary right now, but when
             grows eBPF support it will be)
          * s/n/filter_off and make it an unsigned long to match ptrace
          * count "down" the tree instead of "up" when passing a filter offset
      
      v3: * don't take the current task's lock for inspecting its seccomp mode
          * use a 0x42** constant for the ptrace command value
      
      v4: * don't copy to userspace while holding spinlocks
      
      v5: * add another condition to WARN_ON
      
      v6: * rebase on net-next
      Signed-off-by: NTycho Andersen <tycho.andersen@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      CC: Will Drewry <wad@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      CC: Andy Lutomirski <luto@amacapital.net>
      CC: Pavel Emelyanov <xemul@parallels.com>
      CC: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      CC: Alexei Starovoitov <ast@kernel.org>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8e529ed
  7. 27 10月, 2015 3 次提交
  8. 26 10月, 2015 1 次提交
  9. 23 10月, 2015 2 次提交
  10. 22 10月, 2015 3 次提交