1. 07 1月, 2011 5 次提交
    • N
      fs: rcu-walk for path lookup · 31e6b01f
      Nick Piggin 提交于
      Perform common cases of path lookups without any stores or locking in the
      ancestor dentry elements. This is called rcu-walk, as opposed to the current
      algorithm which is a refcount based walk, or ref-walk.
      
      This results in far fewer atomic operations on every path element,
      significantly improving path lookup performance. It also avoids cacheline
      bouncing on common dentries, significantly improving scalability.
      
      The overall design is like this:
      * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
      * Take the RCU lock for the entire path walk, starting with the acquiring
        of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
        not required for dentry persistence.
      * synchronize_rcu is called when unregistering a filesystem, so we can
        access d_ops and i_ops during rcu-walk.
      * Similarly take the vfsmount lock for the entire path walk. So now mnt
        refcounts are not required for persistence. Also we are free to perform mount
        lookups, and to assume dentry mount points and mount roots are stable up and
        down the path.
      * Have a per-dentry seqlock to protect the dentry name, parent, and inode,
        so we can load this tuple atomically, and also check whether any of its
        members have changed.
      * Dentry lookups (based on parent, candidate string tuple) recheck the parent
        sequence after the child is found in case anything changed in the parent
        during the path walk.
      * inode is also RCU protected so we can load d_inode and use the inode for
        limited things.
      * i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
      * i_op can be loaded.
      
      When we reach the destination dentry, we lock it, recheck lookup sequence,
      and increment its refcount and mountpoint refcount. RCU and vfsmount locks
      are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
      not match, we can not drop rcu-walk gracefully at the current point in the
      lokup, so instead return -ECHILD (for want of a better errno). This signals the
      path walking code to re-do the entire lookup with a ref-walk.
      
      Aside from the final dentry, there are other situations that may be encounted
      where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
      a reference on the last good dentry) and continue with a ref-walk. Again, if
      we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
      using ref-walk. But it is very important that we can continue with ref-walk
      for most cases, particularly to avoid the overhead of double lookups, and to
      gain the scalability advantages on common path elements (like cwd and root).
      
      The cases where rcu-walk cannot continue are:
      * NULL dentry (ie. any uncached path element)
      * parent with d_inode->i_op->permission or ACLs
      * dentries with d_revalidate
      * Following links
      
      In future patches, permission checks and d_revalidate become rcu-walk aware. It
      may be possible eventually to make following links rcu-walk aware.
      
      Uncached path elements will always require dropping to ref-walk mode, at the
      very least because i_mutex needs to be grabbed, and objects allocated.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      31e6b01f
    • N
      fs: dcache rationalise dget variants · dc0474be
      Nick Piggin 提交于
      dget_locked was a shortcut to avoid the lazy lru manipulation when we already
      held dcache_lock (lru manipulation was relatively cheap at that point).
      However, how that the lru lock is an innermost one, we never hold it at any
      caller, so the lock cost can now be avoided. We already have well working lazy
      dcache LRU, so it should be fine to defer LRU manipulations to scan time.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      dc0474be
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: dcache scale subdirs · 2fd6b7f5
      Nick Piggin 提交于
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • N
      fs: dcache scale d_unhashed · da502956
      Nick Piggin 提交于
      Protect d_unhashed(dentry) condition with d_lock. This means keeping
      DCACHE_UNHASHED bit in synch with hash manipulations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      da502956
  2. 06 1月, 2011 1 次提交
  3. 04 1月, 2011 1 次提交
    • M
      ima: fix add LSM rule bug · 867c2026
      Mimi Zohar 提交于
      If security_filter_rule_init() doesn't return a rule, then not everything
      is as fine as the return code implies.
      
      This bug only occurs when the LSM (eg. SELinux) is disabled at runtime.
      
      Adding an empty LSM rule causes ima_match_rules() to always succeed,
      ignoring any remaining rules.
      
       default IMA TCB policy:
        # PROC_SUPER_MAGIC
        dont_measure fsmagic=0x9fa0
        # SYSFS_MAGIC
        dont_measure fsmagic=0x62656572
        # DEBUGFS_MAGIC
        dont_measure fsmagic=0x64626720
        # TMPFS_MAGIC
        dont_measure fsmagic=0x01021994
        # SECURITYFS_MAGIC
        dont_measure fsmagic=0x73636673
      
        < LSM specific rule >
        dont_measure obj_type=var_log_t
      
        measure func=BPRM_CHECK
        measure func=FILE_MMAP mask=MAY_EXEC
        measure func=FILE_CHECK mask=MAY_READ uid=0
      
      Thus without the patch, with the boot parameters 'tcb selinux=0', adding
      the above 'dont_measure obj_type=var_log_t' rule to the default IMA TCB
      measurement policy, would result in nothing being measured.  The patch
      prevents the default TCB policy from being replaced.
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: David Safford <safford@watson.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      867c2026
  4. 24 12月, 2010 1 次提交
    • D
      KEYS: Don't call up_write() if __key_link_begin() returns an error · 3fc5e98d
      David Howells 提交于
      In construct_alloc_key(), up_write() is called in the error path if
      __key_link_begin() fails, but this is incorrect as __key_link_begin() only
      returns with the nominated keyring locked if it returns successfully.
      
      Without this patch, you might see the following in dmesg:
      
      	=====================================
      	[ BUG: bad unlock balance detected! ]
      	-------------------------------------
      	mount.cifs/5769 is trying to release lock (&key->sem) at:
      	[<ffffffff81201159>] request_key_and_link+0x263/0x3fc
      	but there are no more locks to release!
      
      	other info that might help us debug this:
      	3 locks held by mount.cifs/5769:
      	 #0:  (&type->s_umount_key#41/1){+.+.+.}, at: [<ffffffff81131321>] sget+0x278/0x3e7
      	 #1:  (&ret_buf->session_mutex){+.+.+.}, at: [<ffffffffa0258e59>] cifs_get_smb_ses+0x35a/0x443 [cifs]
      	 #2:  (root_key_user.cons_lock){+.+.+.}, at: [<ffffffff81201000>] request_key_and_link+0x10a/0x3fc
      
      	stack backtrace:
      	Pid: 5769, comm: mount.cifs Not tainted 2.6.37-rc6+ #1
      	Call Trace:
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81081601>] print_unlock_inbalance_bug+0xca/0xd5
      	 [<ffffffff81083248>] lock_release_non_nested+0xc1/0x263
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81201159>] ? request_key_and_link+0x263/0x3fc
      	 [<ffffffff81083567>] lock_release+0x17d/0x1a4
      	 [<ffffffff81073f45>] up_write+0x23/0x3b
      	 [<ffffffff81201159>] request_key_and_link+0x263/0x3fc
      	 [<ffffffffa026fe9e>] ? cifs_get_spnego_key+0x61/0x21f [cifs]
      	 [<ffffffff812013c5>] request_key+0x41/0x74
      	 [<ffffffffa027003d>] cifs_get_spnego_key+0x200/0x21f [cifs]
      	 [<ffffffffa026e296>] CIFS_SessSetup+0x55d/0x1273 [cifs]
      	 [<ffffffffa02589e1>] cifs_setup_session+0x90/0x1ae [cifs]
      	 [<ffffffffa0258e7e>] cifs_get_smb_ses+0x37f/0x443 [cifs]
      	 [<ffffffffa025a9e3>] cifs_mount+0x1aa1/0x23f3 [cifs]
      	 [<ffffffff8111fd94>] ? alloc_debug_processing+0xdb/0x120
      	 [<ffffffffa027002c>] ? cifs_get_spnego_key+0x1ef/0x21f [cifs]
      	 [<ffffffffa024cc71>] cifs_do_mount+0x165/0x2b3 [cifs]
      	 [<ffffffff81130e72>] vfs_kern_mount+0xaf/0x1dc
      	 [<ffffffff81131007>] do_kern_mount+0x4d/0xef
      	 [<ffffffff811483b9>] do_mount+0x6f4/0x733
      	 [<ffffffff8114861f>] sys_mount+0x88/0xc2
      	 [<ffffffff8100ac42>] system_call_fastpath+0x16/0x1b
      Reported-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-Tested-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc5e98d
  5. 24 11月, 2010 2 次提交
  6. 18 11月, 2010 1 次提交
  7. 16 11月, 2010 1 次提交
  8. 12 11月, 2010 1 次提交
  9. 11 11月, 2010 2 次提交
  10. 29 10月, 2010 2 次提交
  11. 27 10月, 2010 11 次提交
    • E
      IMA: fix the ToMToU logic · bade72d6
      Eric Paris 提交于
      Current logic looks like this:
      
              rc = ima_must_measure(NULL, inode, MAY_READ, FILE_CHECK);
              if (rc < 0)
                      goto out;
      
              if (mode & FMODE_WRITE) {
                      if (inode->i_readcount)
                              send_tomtou = true;
                      goto out;
              }
      
              if (atomic_read(&inode->i_writecount) > 0)
                      send_writers = true;
      
      Lets assume we have a policy which states that all files opened for read
      by root must be measured.
      
      Lets assume the file has permissions 777.
      
      Lets assume that root has the given file open for read.
      
      Lets assume that a non-root process opens the file write.
      
      The non-root process will get to ima_counts_get() and will check the
      ima_must_measure().  Since it is not supposed to measure it will goto
      out.
      
      We should check the i_readcount no matter what since we might be causing
      a ToMToU voilation!
      
      This is close to correct, but still not quite perfect.  The situation
      could have been that root, which was interested in the mesurement opened
      and closed the file and another process which is not interested in the
      measurement is the one holding the i_readcount ATM.  This is just overly
      strict on ToMToU violations, which is better than not strict enough...
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bade72d6
    • E
      IMA: explicit IMA i_flag to remove global lock on inode_delete · 196f5181
      Eric Paris 提交于
      Currently for every removed inode IMA must take a global lock and search
      the IMA rbtree looking for an associated integrity structure.  Instead
      we explicitly mark an inode when we add an integrity structure so we
      only have to take the global lock and do the removal if it exists.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      196f5181
    • E
      IMA: drop refcnt from ima_iint_cache since it isn't needed · 64c62f06
      Eric Paris 提交于
      Since finding a struct ima_iint_cache requires a valid struct inode, and
      the struct ima_iint_cache is supposed to have the same lifetime as a
      struct inode (technically they die together but don't need to be created
      at the same time) we don't have to worry about the ima_iint_cache
      outliving or dieing before the inode.  So the refcnt isn't useful.  Just
      get rid of it and free the structure when the inode is freed.
      Signed-off-by: NEric Paris <eapris@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64c62f06
    • E
      IMA: only allocate iint when needed · bc7d2a3e
      Eric Paris 提交于
      IMA always allocates an integrity structure to hold information about
      every inode, but only needed this structure to track the number of
      readers and writers currently accessing a given inode.  Since that
      information was moved into struct inode instead of the integrity struct
      this patch stops allocating the integrity stucture until it is needed.
      Thus greatly reducing memory usage.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc7d2a3e
    • E
      IMA: move read counter into struct inode · a178d202
      Eric Paris 提交于
      IMA currently allocated an inode integrity structure for every inode in
      core.  This stucture is about 120 bytes long.  Most files however
      (especially on a system which doesn't make use of IMA) will never need
      any of this space.  The problem is that if IMA is enabled we need to
      know information about the number of readers and the number of writers
      for every inode on the box.  At the moment we collect that information
      in the per inode iint structure and waste the rest of the space.  This
      patch moves those counters into the struct inode so we can eventually
      stop allocating an IMA integrity structure except when absolutely
      needed.
      
      This patch does the minimum needed to move the location of the data.
      Further cleanups, especially the location of counter updates, may still
      be possible.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a178d202
    • E
      IMA: use i_writecount rather than a private counter · b9593d30
      Eric Paris 提交于
      IMA tracks the number of struct files which are holding a given inode
      readonly and the number which are holding the inode write or r/w.  It
      needs this information so when a new reader or writer comes in it can
      tell if this new file will be able to invalidate results it already made
      about existing files.
      
      aka if a task is holding a struct file open RO, IMA measured the file
      and recorded those measurements and then a task opens the file RW IMA
      needs to note in the logs that the old measurement may not be correct.
      It's called a "Time of Measure Time of Use" (ToMToU) issue.  The same is
      true is a RO file is opened to an inode which has an open writer.  We
      cannot, with any validity, measure the file in question since it could
      be changing.
      
      This patch attempts to use the i_writecount field to track writers.  The
      i_writecount field actually embeds more information in it's value than
      IMA needs but it should work for our purposes and allow us to shrink the
      struct inode even more.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9593d30
    • E
      IMA: use inode->i_lock to protect read and write counters · ad16ad00
      Eric Paris 提交于
      Currently IMA used the iint->mutex to protect the i_readcount and
      i_writecount.  This patch uses the inode->i_lock since we are going to
      start using in inode objects and that is the most appropriate lock.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad16ad00
    • E
      IMA: convert internal flags from long to char · 15aac676
      Eric Paris 提交于
      The IMA flags is an unsigned long but there is only 1 flag defined.
      Lets save a little space and make it a char.  This packs nicely next to
      the array of u8's.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15aac676
    • E
      IMA: use unsigned int instead of long for counters · 497f3233
      Eric Paris 提交于
      Currently IMA uses 2 longs in struct inode.  To save space (and as it
      seems impossible to overflow 32 bits) we switch these to unsigned int.
      The switch to unsigned does require slightly different checks for
      underflow, but it isn't complex.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      497f3233
    • E
      IMA: drop the inode opencount since it isn't needed for operation · b575156d
      Eric Paris 提交于
      The opencount was used to help debugging to make sure that everything
      which created a struct file also correctly made the IMA calls.  Since we
      moved all of that into the VFS this isn't as necessary.  We should be
      able to get the same amount of debugging out of just the reader and
      write count.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b575156d
    • E
      IMA: use rbtree instead of radix tree for inode information cache · 85491641
      Eric Paris 提交于
      The IMA code needs to store the number of tasks which have an open fd
      granting permission to write a file even when IMA is not in use.  It
      needs this information in order to be enabled at a later point in time
      without losing it's integrity garantees.
      
      At the moment that means we store a little bit of data about every inode
      in a cache.  We use a radix tree key'd on the inode's memory address.
      Dave Chinner pointed out that a radix tree is a terrible data structure
      for such a sparse key space.  This patch switches to using an rbtree
      which should be more efficient.
      
      Bug report from Dave:
      
       "I just noticed that slabtop was reporting an awfully high usage of
        radix tree nodes:
      
         OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
        4200331 2778082  66%    0.55K 144839       29   2317424K radix_tree_node
        2321500 2060290  88%    1.00K  72581       32   2322592K xfs_inode
        2235648 2069791  92%    0.12K  69864       32    279456K iint_cache
      
        That is, 2.7M radix tree nodes are allocated, and the cache itself is
        consuming 2.3GB of RAM.  I know that the XFS inodei caches are indexed
        by radix tree node, but for 2 million cached inodes that would mean a
        density of 1 inode per radix tree node, which for a system with 16M
        inodes in the filsystems is an impossibly low density.  The worst I've
        seen in a production system like kernel.org is about 20-25% density,
        which would mean about 150-200k radix tree nodes for that many inodes.
        So it's not the inode cache.
      
        So I looked up what the iint_cache was.  It appears to used for
        storing per-inode IMA information, and uses a radix tree for indexing.
        It uses the *address* of the struct inode as the indexing key.  That
        means the key space is extremely sparse - for XFS the struct inode
        addresses are approximately 1000 bytes apart, which means the closest
        the radix tree index keys get is ~1000.  Which means that there is a
        single entry per radix tree leaf node, so the radix tree is using
        roughly 550 bytes for every 120byte structure being cached.  For the
        above example, it's probably wasting close to 1GB of RAM...."
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85491641
  12. 26 10月, 2010 2 次提交
  13. 21 10月, 2010 10 次提交