1. 28 7月, 2014 1 次提交
  2. 26 7月, 2014 2 次提交
  3. 24 7月, 2014 1 次提交
    • E
      CAPABILITIES: remove undefined caps from all processes · 7d8b6c63
      Eric Paris 提交于
      This is effectively a revert of 7b9a7ec5
      plus fixing it a different way...
      
      We found, when trying to run an application from an application which
      had dropped privs that the kernel does security checks on undefined
      capability bits.  This was ESPECIALLY difficult to debug as those
      undefined bits are hidden from /proc/$PID/status.
      
      Consider a root application which drops all capabilities from ALL 4
      capability sets.  We assume, since the application is going to set
      eff/perm/inh from an array that it will clear not only the defined caps
      less than CAP_LAST_CAP, but also the higher 28ish bits which are
      undefined future capabilities.
      
      The BSET gets cleared differently.  Instead it is cleared one bit at a
      time.  The problem here is that in security/commoncap.c::cap_task_prctl()
      we actually check the validity of a capability being read.  So any task
      which attempts to 'read all things set in bset' followed by 'unset all
      things set in bset' will not even attempt to unset the undefined bits
      higher than CAP_LAST_CAP.
      
      So the 'parent' will look something like:
      CapInh:	0000000000000000
      CapPrm:	0000000000000000
      CapEff:	0000000000000000
      CapBnd:	ffffffc000000000
      
      All of this 'should' be fine.  Given that these are undefined bits that
      aren't supposed to have anything to do with permissions.  But they do...
      
      So lets now consider a task which cleared the eff/perm/inh completely
      and cleared all of the valid caps in the bset (but not the invalid caps
      it couldn't read out of the kernel).  We know that this is exactly what
      the libcap-ng library does and what the go capabilities library does.
      They both leave you in that above situation if you try to clear all of
      you capapabilities from all 4 sets.  If that root task calls execve()
      the child task will pick up all caps not blocked by the bset.  The bset
      however does not block bits higher than CAP_LAST_CAP.  So now the child
      task has bits in eff which are not in the parent.  These are
      'meaningless' undefined bits, but still bits which the parent doesn't
      have.
      
      The problem is now in cred_cap_issubset() (or any operation which does a
      subset test) as the child, while a subset for valid cap bits, is not a
      subset for invalid cap bits!  So now we set durring commit creds that
      the child is not dumpable.  Given it is 'more priv' than its parent.  It
      also means the parent cannot ptrace the child and other stupidity.
      
      The solution here:
      1) stop hiding capability bits in status
      	This makes debugging easier!
      
      2) stop giving any task undefined capability bits.  it's simple, it you
      don't put those invalid bits in CAP_FULL_SET you won't get them in init
      and you won't get them in any other task either.
      	This fixes the cap_issubset() tests and resulting fallout (which
      	made the init task in a docker container untraceable among other
      	things)
      
      3) mask out undefined bits when sys_capset() is called as it might use
      ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
      	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
      
      4) mask out undefined bit when we read a file capability off of disk as
      again likely all bits are set in the xattr for forward/backward
      compatibility.
      	This lets 'setcap all+pe /bin/bash; /bin/bash' run
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Dan Walsh <dwalsh@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      7d8b6c63
  4. 23 7月, 2014 4 次提交
  5. 19 7月, 2014 5 次提交
    • K
      seccomp: implement SECCOMP_FILTER_FLAG_TSYNC · c2e1f2e3
      Kees Cook 提交于
      Applying restrictive seccomp filter programs to large or diverse
      codebases often requires handling threads which may be started early in
      the process lifetime (e.g., by code that is linked in). While it is
      possible to apply permissive programs prior to process start up, it is
      difficult to further restrict the kernel ABI to those threads after that
      point.
      
      This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
      synchronizing thread group seccomp filters at filter installation time.
      
      When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
      filter) an attempt will be made to synchronize all threads in current's
      threadgroup to its new seccomp filter program. This is possible iff all
      threads are using a filter that is an ancestor to the filter current is
      attempting to synchronize to. NULL filters (where the task is running as
      SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
      transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
      ...) has been set on the calling thread, no_new_privs will be set for
      all synchronized threads too. On success, 0 is returned. On failure,
      the pid of one of the failing threads will be returned and no filters
      will have been applied.
      
      The race conditions against another thread are:
      - requesting TSYNC (already handled by sighand lock)
      - performing a clone (already handled by sighand lock)
      - changing its filter (already handled by sighand lock)
      - calling exec (handled by cred_guard_mutex)
      The clone case is assisted by the fact that new threads will have their
      seccomp state duplicated from their parent before appearing on the tasklist.
      
      Holding cred_guard_mutex means that seccomp filters cannot be assigned
      while in the middle of another thread's exec (potentially bypassing
      no_new_privs or similar). The call to de_thread() may kill threads waiting
      for the mutex.
      
      Changes across threads to the filter pointer includes a barrier.
      
      Based on patches by Will Drewry.
      Suggested-by: NJulien Tinnes <jln@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      c2e1f2e3
    • K
      seccomp: introduce writer locking · dbd95212
      Kees Cook 提交于
      Normally, task_struct.seccomp.filter is only ever read or modified by
      the task that owns it (current). This property aids in fast access
      during system call filtering as read access is lockless.
      
      Updating the pointer from another task, however, opens up race
      conditions. To allow cross-thread filter pointer updates, writes to the
      seccomp fields are now protected by the sighand spinlock (which is shared
      by all threads in the thread group). Read access remains lockless because
      pointer updates themselves are atomic.  However, writes (or cloning)
      often entail additional checking (like maximum instruction counts)
      which require locking to perform safely.
      
      In the case of cloning threads, the child is invisible to the system
      until it enters the task list. To make sure a child can't be cloned from
      a thread and left in a prior state, seccomp duplication is additionally
      moved under the sighand lock. Then parent and child are certain have
      the same seccomp state when they exit the lock.
      
      Based on patches by Will Drewry and David Drysdale.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      dbd95212
    • K
      sched: move no_new_privs into new atomic flags · 1d4457f9
      Kees Cook 提交于
      Since seccomp transitions between threads requires updates to the
      no_new_privs flag to be atomic, the flag must be part of an atomic flag
      set. This moves the nnp flag into a separate task field, and introduces
      accessors.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      1d4457f9
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9
    • D
      KEYS: Provide a generic instantiation function · 6a09d17b
      David Howells 提交于
      Provide a generic instantiation function for key types that use the preparse
      hook.  This makes it easier to prereserve key quota before keyrings get locked
      to retain the new key.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NJeff Layton <jlayton@primarydata.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6a09d17b
  6. 18 7月, 2014 1 次提交
    • D
      KEYS: Allow special keys (eg. DNS results) to be invalidated by CAP_SYS_ADMIN · 0c7774ab
      David Howells 提交于
      Special kernel keys, such as those used to hold DNS results for AFS, CIFS and
      NFS and those used to hold idmapper results for NFS, used to be
      'invalidateable' with key_revoke().  However, since the default permissions for
      keys were reduced:
      
      	Commit: 96b5c8fe
      	KEYS: Reduce initial permissions on keys
      
      it has become impossible to do this.
      
      Add a key flag (KEY_FLAG_ROOT_CAN_INVAL) that will permit a key to be
      invalidated by root.  This should not be used for system keyrings as the
      garbage collector will try and remove any invalidate key.  For system keyrings,
      KEY_FLAG_ROOT_CAN_CLEAR can be used instead.
      
      After this, from userspace, keyctl_invalidate() and "keyctl invalidate" can be
      used by any possessor of CAP_SYS_ADMIN (typically root) to invalidate DNS and
      idmapper keys.  Invalidated keys are immediately garbage collected and will be
      immediately rerequested if needed again.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NSteve Dickson <steved@redhat.com>
      0c7774ab
  7. 17 7月, 2014 2 次提交
  8. 11 7月, 2014 1 次提交
  9. 10 7月, 2014 1 次提交
    • P
      selinux: fix the default socket labeling in sock_graft() · 4da6daf4
      Paul Moore 提交于
      The sock_graft() hook has special handling for AF_INET, AF_INET, and
      AF_UNIX sockets as those address families have special hooks which
      label the sock before it is attached its associated socket.
      Unfortunately, the sock_graft() hook was missing a default approach
      to labeling sockets which meant that any other address family which
      made use of connections or the accept() syscall would find the
      returned socket to be in an "unlabeled" state.  This was recently
      demonstrated by the kcrypto/AF_ALG subsystem and the newly released
      cryptsetup package (cryptsetup v1.6.5 and later).
      
      This patch preserves the special handling in selinux_sock_graft(),
      but adds a default behavior - setting the sock's label equal to the
      associated socket - which resolves the problem with AF_ALG and
      presumably any other address family which makes use of accept().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      4da6daf4
  10. 09 7月, 2014 3 次提交
  11. 08 7月, 2014 4 次提交
  12. 04 7月, 2014 3 次提交
  13. 03 7月, 2014 1 次提交
    • T
      kernfs: kernfs_notify() must be useable from non-sleepable contexts · ecca47ce
      Tejun Heo 提交于
      d911d987 ("kernfs: make kernfs_notify() trigger inotify events
      too") added fsnotify triggering to kernfs_notify() which requires a
      sleepable context.  There are already existing users of
      kernfs_notify() which invoke it from an atomic context and in general
      it's silly to require a sleepable context for triggering a
      notification.
      
      The following is an invalid context bug triggerd by md invoking
      sysfs_notify() from IO completion path.
      
       BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       2 locks held by swapper/1/0:
        #0:  (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
        #1:  (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
       irq event stamp: 33518
       hardirqs last  enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
       hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
       softirqs last  enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
       softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
        0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
        0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
       Call Trace:
        <IRQ>  [<ffffffff81807b4c>] dump_stack+0x4d/0x66
        [<ffffffff810d4f14>] __might_sleep+0x184/0x240
        [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
        [<ffffffff812d76a0>] kernfs_notify+0x90/0x150
        [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
        [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
        [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
        [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
        [<ffffffff813acb8b>] bio_endio+0x6b/0xa0
        [<ffffffff813b46c4>] blk_update_request+0x94/0x420
        [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
        [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
        [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
        [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
        [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
        [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
        [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
        [<ffffffff8111647d>] handle_irq_event+0x3d/0x60
        [<ffffffff81119436>] handle_edge_irq+0x66/0x130
        [<ffffffff8101c3e4>] handle_irq+0x84/0x150
        [<ffffffff818146ad>] do_IRQ+0x4d/0xe0
        [<ffffffff818122f2>] common_interrupt+0x72/0x72
        <EOI>  [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
        [<ffffffff81025454>] default_idle+0x24/0x230
        [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
        [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
        [<ffffffff8104df1b>] start_secondary+0x25b/0x300
      
      This patch fixes it by punting the notification delivery through a
      work item.  This ends up adding an extra pointer to kernfs_elem_attr
      enlarging kernfs_node by a pointer, which is not ideal but not a very
      big deal either.  If this turns out to be an actual issue, we can move
      kernfs_elem_attr->size to kernfs_node->iattr later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecca47ce
  14. 02 7月, 2014 1 次提交
    • Z
      core: fix typo in percpu read_mostly section · 330d2822
      Zhengyu He 提交于
      This fixes a typo that named the read_mostly section of percpu as
      readmostly. It works fine with SMP because the linker script specifies
      .data..percpu..readmostly. However, UP kernel builds don't have percpu
      sections defined and the non-percpu version of the section is called
      data..read_mostly, so .data..readmostly will float around and may break
      things unexpectedly.
      
      Looking at the original change that introduced data..percpu..readmostly
      (commit c957ef2c), it looks like this
      was the original intention.
      
      Tested: Built UP kernel and confirmed the sections got merged.
      
      - Before the patch:
      $ objdump -h vmlinux.o  | grep '\.data\.\.read.*mostly'
      38 .data..read_mostly 00004418  0000000000000000  0000000000000000  00431ac0  2**6
      50 .data..readmostly 00000014  0000000000000000  0000000000000000  00444000  2**3
      
      - After the patch:
      $ objdump -h vmlinux.o  | grep '\.data\.\.read.*mostly'
      38 .data..read_mostly 00004438  0000000000000000  0000000000000000  00431ac0  2**6
      Signed-off-by: NZhengyu He <hzy@google.com>
      Signed-off-by: NFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      330d2822
  15. 01 7月, 2014 1 次提交
  16. 30 6月, 2014 2 次提交
    • L
      kernfs: introduce kernfs_pin_sb() · 4e26445f
      Li Zefan 提交于
      kernfs_pin_sb() tries to get a refcnt of the superblock.
      
      This will be used by cgroupfs.
      
      v2:
      - make kernfs_pin_sb() return the superblock.
      - drop kernfs_drop_sb().
      
      tj: Updated the comment a bit.
      
      [ This is a prerequisite for a bugfix. ]
      Cc: <stable@vger.kernel.org> # 3.15
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4e26445f
    • D
      clk: exynos5420: Remove aclk66_peric from the clock tree description · 44ff0254
      Doug Anderson 提交于
      The "aclk66_peric" clock is a gate clock with a whole bunch of gates
      underneath it.  This big gate isn't very useful to include in our
      clock tree.  If any of the children need to be turned on then the big
      gate will need to be on anyway.  ...and there are plenty of other "big
      gates" that aren't described in our clock tree, some of which shut off
      collections of clocks that have no relationship in the hierarchy so
      are hard to model.
      
      "aclk66_peric" is causing earlyprintk problems since it gets disabled
      as part of the boot process, so let's just remove it.
      
      Strangely (and for no good reason) this clock is exported as part of
      the common clock bindings.  Remove it since there are no in-kernel
      device trees using it and no reason anyone out of tree should refer to
      it either.
      Signed-off-by: NDoug Anderson <dianders@chromium.org>
      Signed-off-by: NTomasz Figa <t.figa@samsung.com>
      44ff0254
  17. 29 6月, 2014 1 次提交
  18. 28 6月, 2014 1 次提交
  19. 27 6月, 2014 3 次提交
  20. 26 6月, 2014 1 次提交
    • E
      ipv4: fix dst race in sk_dst_get() · f8864972
      Eric Dumazet 提交于
      When IP route cache had been removed in linux-3.6, we broke assumption
      that dst entries were all freed after rcu grace period. DST_NOCACHE
      dst were supposed to be freed from dst_release(). But it appears
      we want to keep such dst around, either in UDP sockets or tunnels.
      
      In sk_dst_get() we need to make sure dst refcount is not 0
      before incrementing it, or else we might end up freeing a dst
      twice.
      
      DST_NOCACHE set on a dst does not mean this dst can not be attached
      to a socket or a tunnel.
      
      Then, before actual freeing, we need to observe a rcu grace period
      to make sure all other cpus can catch the fact the dst is no longer
      usable.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDormando <dormando@rydia.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8864972
  21. 25 6月, 2014 1 次提交
    • J
      block: add support for limiting gaps in SG lists · 66cb45aa
      Jens Axboe 提交于
      Another restriction inherited for NVMe - those devices don't support
      SG lists that have "gaps" in them. Gaps refers to cases where the
      previous SG entry doesn't end on a page boundary. For NVMe, all SG
      entries must start at offset 0 (except the first) and end on a page
      boundary (except the last).
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66cb45aa