1. 01 3月, 2011 1 次提交
  2. 07 1月, 2011 1 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  3. 29 10月, 2010 1 次提交
  4. 26 10月, 2010 3 次提交
  5. 18 8月, 2010 1 次提交
  6. 10 8月, 2010 6 次提交
    • S
      shmem: reduce pagefault lock contention · ff36b801
      Shaohua Li 提交于
      I'm running a shmem pagefault test case (see attached file) under a 64 CPU
      system.  Profile shows shmem_inode_info->lock is heavily contented and
      100% CPUs time are trying to get the lock.  In the pagefault (no swap)
      case, shmem_getpage gets the lock twice, the last one is avoidable if we
      prealloc a page so we could reduce one time of locking.  This is what
      below patch does.
      
      The result of the test case:
      2.6.35-rc3: ~20s
      2.6.35-rc3 + patch: ~12s
      so this is 40% improvement.
      
      One might argue if we could have better locking for shmem.  But even shmem
      is lockless, the pagefault will soon have pagecache lock heavily contented
      because shmem must add new page to pagecache.  So before we have better
      locking for pagecache, improving shmem locking doesn't have too much
      improvement.  I did a similar pagefault test against a ramfs file, the
      test result is ~10.5s.
      
      [akpm@linux-foundation.org: fix comment, clean up code layout, elimintate code duplication]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Zhang, Yanmin" <yanmin.zhang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff36b801
    • T
      tmpfs: make tmpfs scalable with percpu_counter for used blocks · 7e496299
      Tim Chen 提交于
      The current implementation of tmpfs is not scalable.  We found that
      stat_lock is contended by multiple threads when we need to get a new page,
      leading to useless spinning inside this spin lock.
      
      This patch makes use of the percpu_counter library to maintain local count
      of used blocks to speed up getting and returning of pages.  So the
      acquisition of stat_lock is unnecessary for getting and returning blocks,
      improving the performance of tmpfs on system with large number of cpus.
      On a 4 socket 32 core NHM-EX system, we saw improvement of 270%.
      
      The implementation below has a slight chance of race between threads
      causing a slight overshoot of the maximum configured blocks.  However, any
      overshoot is small, and is bounded by the number of cpus.  This happens
      when the number of used blocks is slightly below the maximum configured
      blocks when a thread checks the used block count, and another thread
      allocates the last block before the current thread does.  This should not
      be a problem for tmpfs, as the overshoot is most likely to be a few blocks
      and bounded.  If a strict limit is really desired, then configured the max
      blocks to be the limit less the number of cpus in system.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e496299
    • A
      switch shmem.c to ->evice_inode() · 1f895f75
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f895f75
    • C
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig 提交于
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
    • C
      always call inode_change_ok early in ->setattr · db78b877
      Christoph Hellwig 提交于
      Make sure we call inode_change_ok before doing any changes in ->setattr,
      and make sure to call it even if our fs wants to ignore normal UNIX
      permissions, but use the ATTR_FORCE to skip those.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      db78b877
    • C
      rename generic_setattr · 6a1a90ad
      Christoph Hellwig 提交于
      Despite its name it's now a generic implementation of ->setattr, but
      rather a helper to copy attributes from a struct iattr to the inode.
      Rename it to setattr_copy to reflect this fact.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a1a90ad
  7. 05 6月, 2010 1 次提交
    • N
      fix truncate inode time modification breakage · af5a30d8
      Nick Piggin 提交于
      mtime and ctime should be changed only if the file size has actually
      changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
      sequence has caused regressions where they always update timestamps.
      
      There is some strange cases in POSIX where truncate(2) must not update
      times unless the size has acutally changed, see 6e656be8.
      
      This area is all still rather buggy in different ways in a lot of
      filesystems and needs a cleanup and audit (ideally the vfs will provide
      a simple attribute or call to direct all filesystems exactly which
      attributes to change). But coming up with the best solution will take a
      while and is not appropriate for rc anyway.
      
      So fix recent regression for now.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      af5a30d8
  8. 28 5月, 2010 3 次提交
  9. 25 5月, 2010 1 次提交
  10. 22 5月, 2010 2 次提交
  11. 17 12月, 2009 6 次提交
  12. 16 12月, 2009 1 次提交
  13. 28 9月, 2009 1 次提交
  14. 26 9月, 2009 1 次提交
  15. 22 9月, 2009 4 次提交
    • P
      shmem: initialize struct shmem_sb_info to zero · 425fbf04
      Pekka Enberg 提交于
      Fixes the following kmemcheck false positive (the compiler is using
      a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):
      
      [    0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
      [    0.352000] CPU0 attaching NULL sched-domain.
      [    0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
      memory (9f8020fc)
      [    0.361000]
      a44240820000000041f6998100000000000000000000000000000000ff030000
      [    0.368000]  i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
      u
      [    0.375000]                                                          ^
      [    0.376000]
      [    0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
      [    0.378000] EIP: 0060:[<810a3a95>] EFLAGS: 00010246 CPU: 0
      [    0.379000] EIP is at shmem_fill_super+0xb5/0x120
      [    0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
      [    0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
      [    0.382000]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      [    0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
      [    0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [    0.385000] DR6: ffff4ff0 DR7: 00000400
      [    0.386000]  [<810c25fc>] get_sb_nodev+0x3c/0x80
      [    0.388000]  [<810a3514>] shmem_get_sb+0x14/0x20
      [    0.390000]  [<810c207f>] vfs_kern_mount+0x4f/0x120
      [    0.392000]  [<81b2849e>] init_tmpfs+0x7e/0xb0
      [    0.394000]  [<81b11597>] do_basic_setup+0x17/0x30
      [    0.396000]  [<81b11907>] kernel_init+0x57/0xa0
      [    0.398000]  [<810039b7>] kernel_thread_helper+0x7/0x10
      [    0.400000]  [<ffffffff>] 0xffffffff
      [    0.402000] khelper used greatest stack depth: 2820 bytes left
      [    0.407000] calling  init_mmap_min_addr+0x0/0x10 @ 1
      [    0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Analysed-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425fbf04
    • H
      tmpfs: depend on shmem · 3f96b79a
      Hugh Dickins 提交于
      CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
      CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
      more sense of it by removing CONFIG_TMPFS altogether, always enabling its
      code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
      CONFIG_TMPFS off that we'd better leave that as is.
      
      But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
      make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
      shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
      
      And a selfish change, to prevent the world from being rebuilt when I
      switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
      header files is mm.h shmem_lock() - give that a shmem.c stub instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f96b79a
    • J
      mm: includecheck fix for mm/shmem.c · cff397e6
      Jaswinder Singh Rajput 提交于
      Fix the following 'make includecheck' warning:
      
        mm/shmem.c: linux/vfs.h is included more than once.
      Signed-off-by: NJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cff397e6
    • D
      mm: add_to_swap_cache() does not return -EEXIST · 2ca4532a
      Daisuke Nishimura 提交于
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
      or get_swap_page() would call add_to_swap_cache().  So add_to_swap_cache()
      doesn't return -EEXIST any more.
      
      Even though it doesn't return -EEXIST, it's not good behavior conceptually
      to call swapcache_prepare() in the -EEXIST case, because it means clearing
      SWAP_HAS_CACHE flag while the entry is on swap cache.
      
      This patch removes redundant codes and comments from callers of it, and
      adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca4532a
  16. 16 9月, 2009 3 次提交
    • A
      HWPOISON: Enable .remove_error_page for migration aware file systems · aa261f54
      Andi Kleen 提交于
      Enable removing of corrupted pages through truncation
      for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
      These should cover most server needs.
      
      I chose the set of migration aware file systems for this
      for now, assuming they have been especially audited.
      But in general it should be safe for all file systems
      on the data area that support read/write and truncate.
      
      Caveat: the hardware error handler does not take i_mutex
      for now before calling the truncate function. Is that ok?
      
      Cc: tytso@mit.edu
      Cc: hch@infradead.org
      Cc: mfasheh@suse.com
      Cc: aia21@cantab.net
      Cc: hugh.dickins@tiscali.co.uk
      Cc: swhiteho@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      aa261f54
    • W
      HWPOISON: shmem: call set_page_dirty() with locked page · 6746aff7
      Wu Fengguang 提交于
      The dirtying of page and set_page_dirty() can be moved into the page lock.
      
      - In shmem_write_end(), the page was dirtied while the page lock was held,
        but it's being marked dirty just after dropping the page lock.
      - In shmem_symlink(), both dirtying and marking can be moved into page lock.
      
      It's valuable for the hwpoison code to know whether one bad page can be dropped
      without losing data. It mainly judges by testing the PG_dirty bit after taking
      the page lock. So it becomes important that the dirtying of page and the
      marking of dirtiness are both done inside the page lock. Which is a common
      practice, but sadly not a rule.
      
      The noticeable exceptions are
      - mapped pages
      - pages with buffer_heads
      The above pages could go dirty at any time. Fortunately the hwpoison will
      unmap the page and release the buffer_heads beforehand anyway.
      
      Many other types of pages (eg. metadata pages) can also be dirtied at will by
      their owners, the hwpoison code cannot do meaningful things to them anyway.
      Only the dirtiness of pagecache pages owned by regular files are interested.
      
      v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      6746aff7
    • K
      Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev · 2b2af54a
      Kay Sievers 提交于
      Devtmpfs lets the kernel create a tmpfs instance called devtmpfs
      very early at kernel initialization, before any driver-core device
      is registered. Every device with a major/minor will provide a
      device node in devtmpfs.
      
      Devtmpfs can be changed and altered by userspace at any time,
      and in any way needed - just like today's udev-mounted tmpfs.
      Unmodified udev versions will run just fine on top of it, and will
      recognize an already existing kernel-created device node and use it.
      The default node permissions are root:root 0600. Proper permissions
      and user/group ownership, meaningful symlinks, all other policy still
      needs to be applied by userspace.
      
      If a node is created by devtmps, devtmpfs will remove the device node
      when the device goes away. If the device node was created by
      userspace, or the devtmpfs created node was replaced by userspace, it
      will no longer be removed by devtmpfs.
      
      If it is requested to auto-mount it, it makes init=/bin/sh work
      without any further userspace support. /dev will be fully populated
      and dynamic, and always reflect the current device state of the kernel.
      With the commonly used dynamic device numbers, it solves the problem
      where static devices nodes may point to the wrong devices.
      
      It is intended to make the initial bootup logic simpler and more robust,
      by de-coupling the creation of the inital environment, to reliably run
      userspace processes, from a complex userspace bootstrap logic to provide
      a working /dev.
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Tested-By: NHarald Hoyer <harald@redhat.com>
      Tested-By: NScott James Remnant <scott@ubuntu.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      2b2af54a
  17. 09 9月, 2009 1 次提交
  18. 25 6月, 2009 1 次提交
  19. 24 6月, 2009 1 次提交
  20. 17 6月, 2009 1 次提交