1. 07 1月, 2011 5 次提交
    • N
      fs: rcu-walk for path lookup · 31e6b01f
      Nick Piggin 提交于
      Perform common cases of path lookups without any stores or locking in the
      ancestor dentry elements. This is called rcu-walk, as opposed to the current
      algorithm which is a refcount based walk, or ref-walk.
      
      This results in far fewer atomic operations on every path element,
      significantly improving path lookup performance. It also avoids cacheline
      bouncing on common dentries, significantly improving scalability.
      
      The overall design is like this:
      * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
      * Take the RCU lock for the entire path walk, starting with the acquiring
        of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
        not required for dentry persistence.
      * synchronize_rcu is called when unregistering a filesystem, so we can
        access d_ops and i_ops during rcu-walk.
      * Similarly take the vfsmount lock for the entire path walk. So now mnt
        refcounts are not required for persistence. Also we are free to perform mount
        lookups, and to assume dentry mount points and mount roots are stable up and
        down the path.
      * Have a per-dentry seqlock to protect the dentry name, parent, and inode,
        so we can load this tuple atomically, and also check whether any of its
        members have changed.
      * Dentry lookups (based on parent, candidate string tuple) recheck the parent
        sequence after the child is found in case anything changed in the parent
        during the path walk.
      * inode is also RCU protected so we can load d_inode and use the inode for
        limited things.
      * i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
      * i_op can be loaded.
      
      When we reach the destination dentry, we lock it, recheck lookup sequence,
      and increment its refcount and mountpoint refcount. RCU and vfsmount locks
      are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
      not match, we can not drop rcu-walk gracefully at the current point in the
      lokup, so instead return -ECHILD (for want of a better errno). This signals the
      path walking code to re-do the entire lookup with a ref-walk.
      
      Aside from the final dentry, there are other situations that may be encounted
      where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
      a reference on the last good dentry) and continue with a ref-walk. Again, if
      we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
      using ref-walk. But it is very important that we can continue with ref-walk
      for most cases, particularly to avoid the overhead of double lookups, and to
      gain the scalability advantages on common path elements (like cwd and root).
      
      The cases where rcu-walk cannot continue are:
      * NULL dentry (ie. any uncached path element)
      * parent with d_inode->i_op->permission or ACLs
      * dentries with d_revalidate
      * Following links
      
      In future patches, permission checks and d_revalidate become rcu-walk aware. It
      may be possible eventually to make following links rcu-walk aware.
      
      Uncached path elements will always require dropping to ref-walk mode, at the
      very least because i_mutex needs to be grabbed, and objects allocated.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      31e6b01f
    • N
      fs: dcache rationalise dget variants · dc0474be
      Nick Piggin 提交于
      dget_locked was a shortcut to avoid the lazy lru manipulation when we already
      held dcache_lock (lru manipulation was relatively cheap at that point).
      However, how that the lru lock is an innermost one, we never hold it at any
      caller, so the lock cost can now be avoided. We already have well working lazy
      dcache LRU, so it should be fine to defer LRU manipulations to scan time.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      dc0474be
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: dcache scale subdirs · 2fd6b7f5
      Nick Piggin 提交于
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • N
      fs: dcache scale d_unhashed · da502956
      Nick Piggin 提交于
      Protect d_unhashed(dentry) condition with d_lock. This means keeping
      DCACHE_UNHASHED bit in synch with hash manipulations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      da502956
  2. 06 1月, 2011 1 次提交
  3. 04 1月, 2011 1 次提交
    • M
      ima: fix add LSM rule bug · 867c2026
      Mimi Zohar 提交于
      If security_filter_rule_init() doesn't return a rule, then not everything
      is as fine as the return code implies.
      
      This bug only occurs when the LSM (eg. SELinux) is disabled at runtime.
      
      Adding an empty LSM rule causes ima_match_rules() to always succeed,
      ignoring any remaining rules.
      
       default IMA TCB policy:
        # PROC_SUPER_MAGIC
        dont_measure fsmagic=0x9fa0
        # SYSFS_MAGIC
        dont_measure fsmagic=0x62656572
        # DEBUGFS_MAGIC
        dont_measure fsmagic=0x64626720
        # TMPFS_MAGIC
        dont_measure fsmagic=0x01021994
        # SECURITYFS_MAGIC
        dont_measure fsmagic=0x73636673
      
        < LSM specific rule >
        dont_measure obj_type=var_log_t
      
        measure func=BPRM_CHECK
        measure func=FILE_MMAP mask=MAY_EXEC
        measure func=FILE_CHECK mask=MAY_READ uid=0
      
      Thus without the patch, with the boot parameters 'tcb selinux=0', adding
      the above 'dont_measure obj_type=var_log_t' rule to the default IMA TCB
      measurement policy, would result in nothing being measured.  The patch
      prevents the default TCB policy from being replaced.
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: David Safford <safford@watson.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      867c2026
  4. 24 12月, 2010 1 次提交
    • D
      KEYS: Don't call up_write() if __key_link_begin() returns an error · 3fc5e98d
      David Howells 提交于
      In construct_alloc_key(), up_write() is called in the error path if
      __key_link_begin() fails, but this is incorrect as __key_link_begin() only
      returns with the nominated keyring locked if it returns successfully.
      
      Without this patch, you might see the following in dmesg:
      
      	=====================================
      	[ BUG: bad unlock balance detected! ]
      	-------------------------------------
      	mount.cifs/5769 is trying to release lock (&key->sem) at:
      	[<ffffffff81201159>] request_key_and_link+0x263/0x3fc
      	but there are no more locks to release!
      
      	other info that might help us debug this:
      	3 locks held by mount.cifs/5769:
      	 #0:  (&type->s_umount_key#41/1){+.+.+.}, at: [<ffffffff81131321>] sget+0x278/0x3e7
      	 #1:  (&ret_buf->session_mutex){+.+.+.}, at: [<ffffffffa0258e59>] cifs_get_smb_ses+0x35a/0x443 [cifs]
      	 #2:  (root_key_user.cons_lock){+.+.+.}, at: [<ffffffff81201000>] request_key_and_link+0x10a/0x3fc
      
      	stack backtrace:
      	Pid: 5769, comm: mount.cifs Not tainted 2.6.37-rc6+ #1
      	Call Trace:
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81081601>] print_unlock_inbalance_bug+0xca/0xd5
      	 [<ffffffff81083248>] lock_release_non_nested+0xc1/0x263
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81083567>] lock_release+0x17d/0x1a4
      	 [<ffffffff81073f45>] up_write+0x23/0x3b
      	 [<ffffffff81201159>] request_key_and_link+0x263/0x3fc
      	 [<ffffffffa026fe9e>] ? cifs_get_spnego_key+0x61/0x21f [cifs]
      	 [<ffffffff812013c5>] request_key+0x41/0x74
      	 [<ffffffffa027003d>] cifs_get_spnego_key+0x200/0x21f [cifs]
      	 [<ffffffffa026e296>] CIFS_SessSetup+0x55d/0x1273 [cifs]
      	 [<ffffffffa02589e1>] cifs_setup_session+0x90/0x1ae [cifs]
      	 [<ffffffffa0258e7e>] cifs_get_smb_ses+0x37f/0x443 [cifs]
      	 [<ffffffffa025a9e3>] cifs_mount+0x1aa1/0x23f3 [cifs]
      	 [<ffffffff8111fd94>] ? alloc_debug_processing+0xdb/0x120
      	 [<ffffffffa027002c>] ? cifs_get_spnego_key+0x1ef/0x21f [cifs]
      	 [<ffffffffa024cc71>] cifs_do_mount+0x165/0x2b3 [cifs]
      	 [<ffffffff81130e72>] vfs_kern_mount+0xaf/0x1dc
      	 [<ffffffff81131007>] do_kern_mount+0x4d/0xef
      	 [<ffffffff811483b9>] do_mount+0x6f4/0x733
      	 [<ffffffff8114861f>] sys_mount+0x88/0xc2
      	 [<ffffffff8100ac42>] system_call_fastpath+0x16/0x1b
      Reported-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-Tested-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc5e98d
  5. 17 12月, 2010 1 次提交
    • E
      SELinux: define permissions for DCB netlink messages · 350e4f31
      Eric Paris 提交于
      Commit 2f90b865 added two new netlink message types to the netlink route
      socket.  SELinux has hooks to define if netlink messages are allowed to
      be sent or received, but it did not know about these two new message
      types.  By default we allow such actions so noone likely noticed.  This
      patch adds the proper definitions and thus proper permissions
      enforcement.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      350e4f31
  6. 15 12月, 2010 4 次提交
  7. 08 12月, 2010 2 次提交
    • J
      Smack: Transmute labels on specified directories · 5c6d1125
      Jarkko Sakkinen 提交于
      In a situation where Smack access rules allow processes
      with multiple labels to write to a directory it is easy
      to get into a situation where the directory gets cluttered
      with files that the owner can't deal with because while
      they could be written to the directory a process at the
      label of the directory can't write them. This is generally
      the desired behavior, but when it isn't it is a real
      issue.
      
      This patch introduces a new attribute SMACK64TRANSMUTE that
      instructs Smack to create the file with the label of the directory
      under certain circumstances.
      
      A new access mode, "t" for transmute, is made available to
      Smack access rules, which are expanded from "rwxa" to "rwxat".
      If a file is created in a directory marked as transmutable
      and if access was granted to perform the operation by a rule
      that included the transmute mode, then the file gets the
      Smack label of the directory instead of the Smack label of the
      creating process.
      
      Note that this is equivalent to creating an empty file at the
      label of the directory and then having the other process write
      to it. The transmute scheme requires that both the access rule
      allows transmutation and that the directory be explicitly marked.
      Signed-off-by: NJarkko Sakkinen <ext-jarkko.2.sakkinen@nokia.com>
      Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
      5c6d1125
    • E
      selinux: cache sidtab_context_to_sid results · 73ff5fc0
      Eric Paris 提交于
      sidtab_context_to_sid takes up a large share of time when creating large
      numbers of new inodes (~30-40% in oprofile runs).  This patch implements a
      cache of 3 entries which is checked before we do a full context_to_sid lookup.
      On one system this showed over a x3 improvement in the number of inodes that
      could be created per second and around a 20% improvement on another system.
      
      Any time we look up the same context string sucessivly (imagine ls -lZ) we
      should hit this cache hot.  A cache miss should have a relatively minor affect
      on performance next to doing the full table search.
      
      All operations on the cache are done COMPLETELY lockless.  We know that all
      struct sidtab_node objects created will never be deleted until a new policy is
      loaded thus we never have to worry about a pointer being dereferenced.  Since
      we also know that pointer assignment is atomic we know that the cache will
      always have valid pointers.  Given this information we implement a FIFO cache
      in an array of 3 pointers.  Every result (whether a cache hit or table lookup)
      will be places in the 0 spot of the cache and the rest of the entries moved
      down one spot.  The 3rd entry will be lost.
      
      Races are possible and are even likely to happen.  Lets assume that 4 tasks
      are hitting sidtab_context_to_sid.  The first task checks against the first
      entry in the cache and it is a miss.  Now lets assume a second task updates
      the cache with a new entry.  This will push the first entry back to the second
      spot.  Now the first task might check against the second entry (which it
      already checked) and will miss again.  Now say some third task updates the
      cache and push the second entry to the third spot.  The first task my check
      the third entry (for the third time!) and again have a miss.  At which point
      it will just do a full table lookup.  No big deal!
      Signed-off-by: NEric Paris <eparis@redhat.com>
      73ff5fc0
  8. 03 12月, 2010 1 次提交
    • E
      SELinux: do not compute transition labels on mountpoint labeled filesystems · 415103f9
      Eric Paris 提交于
      selinux_inode_init_security computes transitions sids even for filesystems
      that use mount point labeling.  It shouldn't do that.  It should just use
      the mount point label always and no matter what.
      
      This causes 2 problems.  1) it makes file creation slower than it needs to be
      since we calculate the transition sid and 2) it allows files to be created
      with a different label than the mount point!
      
      # id -Z
      staff_u:sysadm_r:sysadm_t:s0-s0:c0.c1023
      # sesearch --type --class file --source sysadm_t --target tmp_t
      Found 1 semantic te rules:
         type_transition sysadm_t tmp_t : file user_tmp_t;
      
      # mount -o loop,context="system_u:object_r:tmp_t:s0"  /tmp/fs /mnt/tmp
      
      # ls -lZ /mnt/tmp
      drwx------. root root system_u:object_r:tmp_t:s0       lost+found
      # touch /mnt/tmp/file1
      # ls -lZ /mnt/tmp
      -rw-r--r--. root root staff_u:object_r:user_tmp_t:s0   file1
      drwx------. root root system_u:object_r:tmp_t:s0       lost+found
      
      Whoops, we have a mount point labeled filesystem tmp_t with a user_tmp_t
      labeled file!
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NReviewed-by: James Morris <jmorris@namei.org>
      415103f9
  9. 02 12月, 2010 1 次提交
    • C
      This patch adds a new security attribute to Smack called · 676dac4b
      Casey Schaufler 提交于
      SMACK64EXEC. It defines label that is used while task is
      running.
      
      Exception: in smack_task_wait() child task is checked
      for write access to parent task using label inherited
      from the task that forked it.
      
      Fixed issues from previous submit:
      - SMACK64EXEC was not read when SMACK64 was not set.
      - inode security blob was not updated after setting
        SMACK64EXEC
      - inode security blob was not updated when removing
        SMACK64EXEC
      676dac4b
  10. 01 12月, 2010 8 次提交
  11. 30 11月, 2010 1 次提交
  12. 29 11月, 2010 4 次提交
    • C
      Smack: UDS revision · b4e0d5f0
      Casey Schaufler 提交于
      This patch addresses a number of long standing issues
          with the way Smack treats UNIX domain sockets.
      
          All access control was being done based on the label of
          the file system object. This is inconsistant with the
          internet domain, in which access is done based on the
          IPIN and IPOUT attributes of the socket. As a result
          of the inode label policy it was not possible to use
          a UDS socket for label cognizant services, including
          dbus and the X11 server.
      
          Support for SCM_PEERSEC on UDS sockets is also provided.
      Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b4e0d5f0
    • M
      keys: add new key-type encrypted · 7e70cb49
      Mimi Zohar 提交于
      Define a new kernel key-type called 'encrypted'. Encrypted keys are kernel
      generated random numbers, which are encrypted/decrypted with a 'trusted'
      symmetric key. Encrypted keys are created/encrypted/decrypted in the kernel.
      Userspace only ever sees/stores encrypted blobs.
      
      Changelog:
      - bug fix: replaced master-key rcu based locking with semaphore
        (reported by David Howells)
      - Removed memset of crypto_shash_digest() digest output
      - Replaced verification of 'key-type:key-desc' using strcspn(), with
        one based on string constants.
      - Moved documentation to Documentation/keys-trusted-encrypted.txt
      - Replace hash with shash (based on comments by David Howells)
      - Make lengths/counts size_t where possible (based on comments by David Howells)
        Could not convert most lengths, as crypto expects 'unsigned int'
        (size_t: on 32 bit is defined as unsigned int, but on 64 bit is unsigned long)
      - Add 'const' where possible (based on comments by David Howells)
      - allocate derived_buf dynamically to support arbitrary length master key
        (fixed by Roberto Sassu)
      - wait until late_initcall for crypto libraries to be registered
      - cleanup security/Kconfig
      - Add missing 'update' keyword (reported/fixed by Roberto Sassu)
      - Free epayload on failure to create key (reported/fixed by Roberto Sassu)
      - Increase the data size limit (requested by Roberto Sassu)
      - Crypto return codes are always 0 on success and negative on failure,
        remove unnecessary tests.
      - Replaced kzalloc() with kmalloc()
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: NDavid Safford <safford@watson.ibm.com>
      Reviewed-by: NRoberto Sassu <roberto.sassu@polito.it>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7e70cb49
    • M
      keys: add new trusted key-type · d00a1c72
      Mimi Zohar 提交于
      Define a new kernel key-type called 'trusted'.  Trusted keys are random
      number symmetric keys, generated and RSA-sealed by the TPM.  The TPM
      only unseals the keys, if the boot PCRs and other criteria match.
      Userspace can only ever see encrypted blobs.
      
      Based on suggestions by Jason Gunthorpe, several new options have been
      added to support additional usages.
      
      The new options are:
      migratable=  designates that the key may/may not ever be updated
                   (resealed under a new key, new pcrinfo or new auth.)
      
      pcrlock=n    extends the designated PCR 'n' with a random value,
                   so that a key sealed to that PCR may not be unsealed
                   again until after a reboot.
      
      keyhandle=   specifies the sealing/unsealing key handle.
      
      keyauth=     specifies the sealing/unsealing key auth.
      
      blobauth=    specifies the sealed data auth.
      
      Implementation of a kernel reserved locality for trusted keys will be
      investigated for a possible future extension.
      
      Changelog:
      - Updated and added examples to Documentation/keys-trusted-encrypted.txt
      - Moved generic TPM constants to include/linux/tpm_command.h
        (David Howell's suggestion.)
      - trusted_defined.c: replaced kzalloc with kmalloc, added pcrlock failure
        error handling, added const qualifiers where appropriate.
      - moved to late_initcall
      - updated from hash to shash (suggestion by David Howells)
      - reduced worst stack usage (tpm_seal) from 530 to 312 bytes
      - moved documentation to Documentation directory (suggestion by David Howells)
      - all the other code cleanups suggested by David Howells
      - Add pcrlock CAP_SYS_ADMIN dependency (based on comment by Jason Gunthorpe)
      - New options: migratable, pcrlock, keyhandle, keyauth, blobauth (based on
        discussions with Jason Gunthorpe)
      - Free payload on failure to create key(reported/fixed by Roberto Sassu)
      - Updated Kconfig and other descriptions (based on Serge Hallyn's suggestion)
      - Replaced kzalloc() with kmalloc() (reported by Serge Hallyn)
      Signed-off-by: NDavid Safford <safford@watson.ibm.com>
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d00a1c72
    • S
      security: Define CAP_SYSLOG · ce6ada35
      Serge E. Hallyn 提交于
      Privileged syslog operations currently require CAP_SYS_ADMIN.  Split
      this off into a new CAP_SYSLOG privilege which we can sanely take away
      from a container through the capability bounding set.
      
      With this patch, an lxc container can be prevented from messing with
      the host's syslog (i.e. dmesg -c).
      
      Changelog: mar 12 2010: add selinux capability2:cap_syslog perm
      Changelog: nov 22 2010:
      	. port to new kernel
      	. add a WARN_ONCE if userspace isn't using CAP_SYSLOG
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NAndrew G. Morgan <morgan@kernel.org>
      Acked-By: NKees Cook <kees.cook@canonical.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: "Christopher J. PeBenito" <cpebenito@tresys.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ce6ada35
  13. 24 11月, 2010 2 次提交
  14. 18 11月, 2010 1 次提交
  15. 16 11月, 2010 1 次提交
  16. 12 11月, 2010 1 次提交
  17. 11 11月, 2010 2 次提交
  18. 29 10月, 2010 2 次提交
  19. 27 10月, 2010 1 次提交
    • E
      IMA: fix the ToMToU logic · bade72d6
      Eric Paris 提交于
      Current logic looks like this:
      
              rc = ima_must_measure(NULL, inode, MAY_READ, FILE_CHECK);
              if (rc < 0)
                      goto out;
      
              if (mode & FMODE_WRITE) {
                      if (inode->i_readcount)
                              send_tomtou = true;
                      goto out;
              }
      
              if (atomic_read(&inode->i_writecount) > 0)
                      send_writers = true;
      
      Lets assume we have a policy which states that all files opened for read
      by root must be measured.
      
      Lets assume the file has permissions 777.
      
      Lets assume that root has the given file open for read.
      
      Lets assume that a non-root process opens the file write.
      
      The non-root process will get to ima_counts_get() and will check the
      ima_must_measure().  Since it is not supposed to measure it will goto
      out.
      
      We should check the i_readcount no matter what since we might be causing
      a ToMToU voilation!
      
      This is close to correct, but still not quite perfect.  The situation
      could have been that root, which was interested in the mesurement opened
      and closed the file and another process which is not interested in the
      measurement is the one holding the i_readcount ATM.  This is just overly
      strict on ToMToU violations, which is better than not strict enough...
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bade72d6