1. 12 5月, 2011 3 次提交
    • H
      tmpfs: fix spurious ENOSPC when racing with unswap · 59a16ead
      Hugh Dickins 提交于
      Testing the shmem_swaplist replacements for igrab() revealed another bug:
      writes to /dev/loop0 on a tmpfs file which fills its filesystem were
      sometimes failing with "Buffer I/O error"s.
      
      These came from ENOSPC failures of shmem_getpage(), when racing with
      swapoff: the same could happen when racing with another shmem_getpage(),
      pulling the page in from swap in between our find_lock_page() and our
      taking the info->lock (though not in the single-threaded loop case).
      
      This is unacceptable, and surprising that I've not noticed it before:
      it dates back many years, but (presumably) was made a lot easier to
      reproduce in 2.6.36, which sited a page preallocation in the race window.
      
      Fix it by rechecking the page cache before settling on an ENOSPC error.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59a16ead
    • H
      tmpfs: fix race between umount and swapoff · 778dd893
      Hugh Dickins 提交于
      The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
      to umount as that in shmem_writepage().
      
      Fix this instance by extending the protection of shmem_swaplist_mutex
      right across shmem_unuse_inode(): while it's on the list, the inode cannot
      be evicted (and the filesystem cannot be unmounted) without
      shmem_evict_inode() taking that mutex to remove it from the list.
      
      But since shmem_writepage() might take that mutex, we should avoid making
      memory allocations or memcg charges while holding it: prepare them at the
      outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
      originally placed, we didn't know until that point that the page from swap
      was actually a shmem page; but nowadays it's noted in the swap_map, so
      we're safe to charge upfront.  For the radix_tree, do as is done in
      shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
      habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
      on occasion if subsequently preempted.
      
      With the allocation and charge moved out from shmem_unuse_inode(),
      we can also hold index map and info->lock over from finding the entry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778dd893
    • H
      tmpfs: fix race between umount and writepage · b1dea800
      Hugh Dickins 提交于
      Konstanin Khlebnikov reports that a dangerous race between umount and
      shmem_writepage can be reproduced by this script:
      
        for i in {1..300} ; do
      	mkdir $i
      	while true ; do
      		mount -t tmpfs none $i
      		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
      		umount $i
      	done &
        done
      
      on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)
      
      Kernel log:
      
        VFS: Busy inodes after unmount of tmpfs.
                       Self-destruct in 5 seconds.  Have a nice day...
      
        WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
        Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
        Call Trace:
         warn_slowpath_common+0x80/0x98
         warn_slowpath_fmt+0x41/0x43
         __list_del_entry+0x8d/0x98
         evict+0x50/0x113
         iput+0x138/0x141
        ...
        BUG: unable to handle kernel paging request at ffffffffffffffff
        IP: shmem_free_blocks+0x18/0x4c
        Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
        Call Trace:
         shmem_recalc_inode+0x61/0x66
         shmem_writepage+0xba/0x1dc
         pageout+0x13c/0x24c
         shrink_page_list+0x28e/0x4be
         shrink_inactive_list+0x21f/0x382
        ...
      
      shmem_writepage() calls igrab() on the inode for the page which came from
      page reclaim, to add it later into shmem_swaplist for swapoff operation.
      
      This igrab() can race with super-block deactivating process:
      
        shrink_inactive_list()          deactivate_super()
        pageout()                       tmpfs_fs_type->kill_sb()
        shmem_writepage()               kill_litter_super()
                                        generic_shutdown_super()
                                         evict_inodes()
         igrab()
                                          atomic_read(&inode->i_count)
                                           skip-inode
         iput()
                                         if (!list_empty(&sb->s_inodes))
                                                printk("VFS: Busy inodes after...
      
      This igrap-iput pair was added in commit 1b1b32f2 "tmpfs: fix
      shmem_swaplist races" based on incorrect assumptions: igrab() protects the
      inode from concurrent eviction by deletion, but it does nothing to protect
      it from concurrent unmounting, which goes ahead despite the raised
      i_count.
      
      So this use of igrab() was wrong all along, but the race made much worse
      in 2.6.37 when commit 63997e98 "split invalidate_inodes()" replaced
      two attempts at invalidate_inodes() by a single evict_inodes().
      
      Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
      whether it was correct or not; but burnt once by igrab(), I am sure that
      we don't want to rely more deeply upon externals here.
      
      Fix it by adding the inode to shmem_swaplist earlier, while the page lock
      on page in page cache still secures the inode against eviction, without
      artifically raising i_count.  It was originally added later because
      shmem_unuse_inode() is liable to remove an inode from the list while it's
      unswapped; but we can guard against that by taking spinlock before
      dropping mutex.
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dea800
  2. 15 4月, 2011 1 次提交
  3. 23 3月, 2011 2 次提交
  4. 14 3月, 2011 1 次提交
  5. 10 3月, 2011 1 次提交
  6. 01 3月, 2011 1 次提交
  7. 02 2月, 2011 1 次提交
    • E
      fs/vfs/security: pass last path component to LSM on inode creation · 2a7dba39
      Eric Paris 提交于
      SELinux would like to implement a new labeling behavior of newly created
      inodes.  We currently label new inodes based on the parent and the creating
      process.  This new behavior would also take into account the name of the
      new object when deciding the new label.  This is not the (supposed) full path,
      just the last component of the path.
      
      This is very useful because creating /etc/shadow is different than creating
      /etc/passwd but the kernel hooks are unable to differentiate these
      operations.  We currently require that userspace realize it is doing some
      difficult operation like that and than userspace jumps through SELinux hoops
      to get things set up correctly.  This patch does not implement new
      behavior, that is obviously contained in a seperate SELinux patch, but it
      does pass the needed name down to the correct LSM hook.  If no such name
      exists it is fine to pass NULL.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      2a7dba39
  8. 07 1月, 2011 1 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  9. 29 10月, 2010 1 次提交
  10. 26 10月, 2010 3 次提交
  11. 18 8月, 2010 1 次提交
  12. 10 8月, 2010 6 次提交
    • S
      shmem: reduce pagefault lock contention · ff36b801
      Shaohua Li 提交于
      I'm running a shmem pagefault test case (see attached file) under a 64 CPU
      system.  Profile shows shmem_inode_info->lock is heavily contented and
      100% CPUs time are trying to get the lock.  In the pagefault (no swap)
      case, shmem_getpage gets the lock twice, the last one is avoidable if we
      prealloc a page so we could reduce one time of locking.  This is what
      below patch does.
      
      The result of the test case:
      2.6.35-rc3: ~20s
      2.6.35-rc3 + patch: ~12s
      so this is 40% improvement.
      
      One might argue if we could have better locking for shmem.  But even shmem
      is lockless, the pagefault will soon have pagecache lock heavily contented
      because shmem must add new page to pagecache.  So before we have better
      locking for pagecache, improving shmem locking doesn't have too much
      improvement.  I did a similar pagefault test against a ramfs file, the
      test result is ~10.5s.
      
      [akpm@linux-foundation.org: fix comment, clean up code layout, elimintate code duplication]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Zhang, Yanmin" <yanmin.zhang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff36b801
    • T
      tmpfs: make tmpfs scalable with percpu_counter for used blocks · 7e496299
      Tim Chen 提交于
      The current implementation of tmpfs is not scalable.  We found that
      stat_lock is contended by multiple threads when we need to get a new page,
      leading to useless spinning inside this spin lock.
      
      This patch makes use of the percpu_counter library to maintain local count
      of used blocks to speed up getting and returning of pages.  So the
      acquisition of stat_lock is unnecessary for getting and returning blocks,
      improving the performance of tmpfs on system with large number of cpus.
      On a 4 socket 32 core NHM-EX system, we saw improvement of 270%.
      
      The implementation below has a slight chance of race between threads
      causing a slight overshoot of the maximum configured blocks.  However, any
      overshoot is small, and is bounded by the number of cpus.  This happens
      when the number of used blocks is slightly below the maximum configured
      blocks when a thread checks the used block count, and another thread
      allocates the last block before the current thread does.  This should not
      be a problem for tmpfs, as the overshoot is most likely to be a few blocks
      and bounded.  If a strict limit is really desired, then configured the max
      blocks to be the limit less the number of cpus in system.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e496299
    • A
      switch shmem.c to ->evice_inode() · 1f895f75
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f895f75
    • C
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig 提交于
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
    • C
      always call inode_change_ok early in ->setattr · db78b877
      Christoph Hellwig 提交于
      Make sure we call inode_change_ok before doing any changes in ->setattr,
      and make sure to call it even if our fs wants to ignore normal UNIX
      permissions, but use the ATTR_FORCE to skip those.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      db78b877
    • C
      rename generic_setattr · 6a1a90ad
      Christoph Hellwig 提交于
      Despite its name it's now a generic implementation of ->setattr, but
      rather a helper to copy attributes from a struct iattr to the inode.
      Rename it to setattr_copy to reflect this fact.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a1a90ad
  13. 05 6月, 2010 1 次提交
    • N
      fix truncate inode time modification breakage · af5a30d8
      Nick Piggin 提交于
      mtime and ctime should be changed only if the file size has actually
      changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
      sequence has caused regressions where they always update timestamps.
      
      There is some strange cases in POSIX where truncate(2) must not update
      times unless the size has acutally changed, see 6e656be8.
      
      This area is all still rather buggy in different ways in a lot of
      filesystems and needs a cleanup and audit (ideally the vfs will provide
      a simple attribute or call to direct all filesystems exactly which
      attributes to change). But coming up with the best solution will take a
      while and is not appropriate for rc anyway.
      
      So fix recent regression for now.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      af5a30d8
  14. 28 5月, 2010 3 次提交
  15. 25 5月, 2010 1 次提交
  16. 22 5月, 2010 2 次提交
  17. 17 12月, 2009 6 次提交
  18. 16 12月, 2009 1 次提交
  19. 28 9月, 2009 1 次提交
  20. 26 9月, 2009 1 次提交
  21. 22 9月, 2009 2 次提交
    • P
      shmem: initialize struct shmem_sb_info to zero · 425fbf04
      Pekka Enberg 提交于
      Fixes the following kmemcheck false positive (the compiler is using
      a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):
      
      [    0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
      [    0.352000] CPU0 attaching NULL sched-domain.
      [    0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
      memory (9f8020fc)
      [    0.361000]
      a44240820000000041f6998100000000000000000000000000000000ff030000
      [    0.368000]  i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
      u
      [    0.375000]                                                          ^
      [    0.376000]
      [    0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
      [    0.378000] EIP: 0060:[<810a3a95>] EFLAGS: 00010246 CPU: 0
      [    0.379000] EIP is at shmem_fill_super+0xb5/0x120
      [    0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
      [    0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
      [    0.382000]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      [    0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
      [    0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [    0.385000] DR6: ffff4ff0 DR7: 00000400
      [    0.386000]  [<810c25fc>] get_sb_nodev+0x3c/0x80
      [    0.388000]  [<810a3514>] shmem_get_sb+0x14/0x20
      [    0.390000]  [<810c207f>] vfs_kern_mount+0x4f/0x120
      [    0.392000]  [<81b2849e>] init_tmpfs+0x7e/0xb0
      [    0.394000]  [<81b11597>] do_basic_setup+0x17/0x30
      [    0.396000]  [<81b11907>] kernel_init+0x57/0xa0
      [    0.398000]  [<810039b7>] kernel_thread_helper+0x7/0x10
      [    0.400000]  [<ffffffff>] 0xffffffff
      [    0.402000] khelper used greatest stack depth: 2820 bytes left
      [    0.407000] calling  init_mmap_min_addr+0x0/0x10 @ 1
      [    0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Analysed-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425fbf04
    • H
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins 提交于
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a