1. 23 7月, 2011 1 次提交
  2. 21 7月, 2011 7 次提交
    • W
      fs:update the NOTE of the file_operations structure · 295cc522
      Wanlong Gao 提交于
      Big kernel lock had been removed and setlease now use the lock_flocks()
      to hold a special spin lock file_lock_lock by Matthew.
      So just remove the out-of-date NOTE.
      Signed-off-by: NWanlong Gao <gaowanlong@cn.fujitsu.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      295cc522
    • J
      fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82
      Josef Bacik 提交于
      Btrfs needs to be able to control how filemap_write_and_wait_range() is called
      in fsync to make it less of a painful operation, so push down taking i_mutex and
      the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
      file systems can drop taking the i_mutex altogether it seems, like ext3 and
      ocfs2.  For correctness sake I just pushed everything down in all cases to make
      sure that we keep the current behavior the same for everybody, and then each
      individual fs maintainer can make up their mind about what to do from there.
      Thanks,
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      02c24a82
    • J
      fs: add SEEK_HOLE and SEEK_DATA flags · 982d8165
      Josef Bacik 提交于
      This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags.  Turns out
      using fiemap in things like cp cause more problems than it solves, so lets try
      and give userspace an interface that doesn't suck.  We need to match solaris
      here, and the definitions are
      
      *o* If /whence/ is SEEK_HOLE, the offset of the start of the
      next hole greater than or equal to the supplied offset
      is returned. The definition of a hole is provided near
      the end of the DESCRIPTION.
      
      *o* If /whence/ is SEEK_DATA, the file pointer is set to the
      start of the next non-hole file region greater than or
      equal to the supplied offset.
      
      So in the generic case the entire file is data and there is a virtual hole at
      the end.  That means we will just return i_size for SEEK_HOLE and will return
      the same offset for SEEK_DATA.  This is how Solaris does it so we have to do it
      the same way.
      
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      982d8165
    • C
      fs: simplify the blockdev_direct_IO prototype · aacfc19c
      Christoph Hellwig 提交于
      Simple filesystems always pass inode->i_sb_bdev as the block device
      argument, and never need a end_io handler.  Let's simply things for
      them and for my grepping activity by dropping these arguments.  The
      only thing not falling into that scheme is ext4, which passes and
      end_io handler without needing special flags (yet), but given how
      messy the direct I/O code there is use of __blockdev_direct_IO
      in one instead of two out of three cases isn't going to make a large
      difference anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aacfc19c
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
    • D
      superblock: add filesystem shrinker operations · 0e1fdafd
      Dave Chinner 提交于
      Now we have a per-superblock shrinker implementation, we can add a
      filesystem specific callout to it to allow filesystem internal
      caches to be shrunk by the superblock shrinker.
      
      Rather than perpetuate the multipurpose shrinker callback API (i.e.
      nr_to_scan == 0 meaning "tell me how many objects freeable in the
      cache), two operations will be added. The first will return the
      number of objects that are freeable, the second is the actual
      shrinker call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0e1fdafd
    • D
      superblock: introduce per-sb cache shrinker infrastructure · b0d40c92
      Dave Chinner 提交于
      With context based shrinkers, we can implement a per-superblock
      shrinker that shrinks the caches attached to the superblock. We
      currently have global shrinkers for the inode and dentry caches that
      split up into per-superblock operations via a coarse proportioning
      method that does not batch very well.  The global shrinkers also
      have a dependency - dentries pin inodes - so we have to be very
      careful about how we register the global shrinkers so that the
      implicit call order is always correct.
      
      With a per-sb shrinker callout, we can encode this dependency
      directly into the per-sb shrinker, hence avoiding the need for
      strictly ordering shrinker registrations. We also have no need for
      any proportioning code for the shrinker subsystem already provides
      this functionality across all shrinkers. Allowing the shrinker to
      operate on a single superblock at a time means that we do less
      superblock list traversals and locking and reclaim should batch more
      effectively. This should result in less CPU overhead for reclaim and
      potentially faster reclaim of items from each filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d40c92
  3. 20 7月, 2011 12 次提交
  4. 28 6月, 2011 1 次提交
    • J
      mm: fix assertion mapping->nrpages == 0 in end_writeback() · 08142579
      Jan Kara 提交于
      Under heavy memory and filesystem load, users observe the assertion
      mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
      page reclaim reclaiming the last page from a mapping in the following
      race:
      
      	CPU0				CPU1
        ...
        shrink_page_list()
          __remove_mapping()
            __delete_from_page_cache()
              radix_tree_delete()
      					evict_inode()
      					  truncate_inode_pages()
      					    truncate_inode_pages_range()
      					      pagevec_lookup() - finds nothing
      					  end_writeback()
      					    mapping->nrpages != 0 -> BUG
              page->mapping = NULL
              mapping->nrpages--
      
      Fix the problem by doing a reliable check of mapping->nrpages under
      mapping->tree_lock in end_writeback().
      
      Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
      by Miklos Szeredi <mszeredi@suse.de>.
      
      Cc: Jay <jinshan.xiong@whamcloud.com>
      Cc: Miklos Szeredi <mszeredi@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08142579
  5. 21 6月, 2011 1 次提交
    • L
      vfs: i_state needs to be 'unsigned long' for now · 79568f5b
      Linus Torvalds 提交于
      Commit 13e12d14 ("vfs: reorganize 'struct inode' layout a bit")
      moved things around a bit changed i_state to be unsigned int instead of
      unsigned long.  That was to help structure layout for the 64-bit case,
      and shrink 'struct inode' a bit (admittedly that only happened when
      spinlock debugging was on and i_flags didn't pack with i_lock).
      
      However, Meelis Roos reports that this results in unaligned exceptions
      on sprc, and it turns out that the bit-locking primitives that we use
      for the I_NEW bit want to use the bitops.  Which want 'unsigned long',
      not 'unsigned int'.
      
      We really should fix the bit locking code to not have that kind of
      requirement, but that's a much bigger change.  So for now, revert that
      field back to 'unsigned long' (but keep the other re-ordering changes
      from the commit that caused this).
      
      Andi points out that we have played games with this in 'struct page', so
      it's solvable with other hacks too, but since right now the struct inode
      size advantage only happens with some rare config options, it's not
      worth fighting.
      
      It _would_ be worth fixing the bitlocking code, though.  Especially
      since there is no type safety in the bitlocking code (this never caused
      any warnings, and worked fine on x86-64, because the bitlocks take a
      'void *' and x86-64 doesn't care that deeply about alignment).  So it's
      currently a very easy problem to trigger by mistake and never notice.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79568f5b
  6. 09 6月, 2011 1 次提交
    • L
      vfs: reorganize 'struct inode' layout a bit · 13e12d14
      Linus Torvalds 提交于
      This tries to make the 'struct inode' accesses denser in the data cache
      by moving a commonly accessed field (i_security) closer to other fields
      that are accessed often.
      
      It also makes 'i_state' just an 'unsigned int' rather than 'unsigned
      long', since we only use a few bits of that field, and moves it next to
      the existing 'i_flags' so that we potentially get better structure
      layout (although depending on config options, i_flags may already have
      packed in the same word as i_lock, so this improves packing only for the
      case of spinlock debugging)
      
      Out 'struct inode' is still way too big, and we should probably move
      some other fields around too (the acl fields in particular) for better
      data cache access density.  Other fields (like the inode hash) are
      likely to be entirely irrelevant under most loads.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e12d14
  7. 04 6月, 2011 1 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
  8. 29 5月, 2011 1 次提交
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  9. 27 5月, 2011 2 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
    • D
      fs: add field to superblock to support cleancache · 9fdfdcf1
      Dan Magenheimer 提交于
      This second patch of eight in this cleancache series adds a field to
      the generic superblock to squirrel away a pool identifier that is
      dynamically provided by cleancache-enabled filesystems at mount time
      to uniquely identify files and pages belonging to this mounted filesystem.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: trivial merge conflict update]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      9fdfdcf1
  10. 25 5月, 2011 3 次提交
    • T
      ulimit: raise default hard ulimit on number of files to 4096 · 0ac1ee0b
      Tim Gardner 提交于
      Apps are increasingly using more than 1024 file descriptors.  See
      discussion in several distro bug trackers, e.g.  BugLink:
      http://bugs.launchpad.net/bugs/663090
      https://issues.rpath.com/browse/RPL-2054
      
      You don't want to raise the default soft limit, since that might break
      apps that use select(), but it's safe to raise the default hard limit;
      that way, apps that know they need lots of file descriptors can raise
      their soft limit without needing root, and without user intervention.
      
      Ubuntu is doing this with a kernel change because they have a policy of
      not changing kernel defaults in userland.
      
      While 4096 might not be enough for *all* apps, it seems to be plenty for
      the apps I've seen lately that are unhappy with 1024.
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Cc: Dan Kegel <dank@kegel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ac1ee0b
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
  11. 15 5月, 2011 1 次提交
    • L
      fs: remove FS_COW_FL · e1e8fb6a
      Li Zefan 提交于
      FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
      COW in btrfs, but FS_NOCOW_FL is sufficient.
      
      The fact is we don't have corresponding BTRFS_INODE_COW flag.
      
      COW is default, and FS_NOCOW_FL can be used to switch off COW for
      a single file.
      
      If we mount btrfs with nodatacow, a newly created file will be set with
      the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
      FS_NOCOW_FL flag.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e1e8fb6a
  12. 13 4月, 2011 2 次提交
    • L
      vfs: Re-introduce s_uuid in the superblock · 0bba0169
      Linus Torvalds 提交于
      Gaah.  When commit be85bcca reverted the export of file system uuid
      via /proc/<pid>/mountinfo, it also unintentionally removed the s_uuid
      field in struct super_block.
      
      I didn't mean to do that, since filesystems have been taught to fill it
      in (and we want to keep it for future re-introduction in the mountinfo
      file).
      
      Stupid of me. This adds it back in.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bba0169
    • L
      Revert "vfs: Export file system uuid via /proc/<pid>/mountinfo" · be85bcca
      Linus Torvalds 提交于
      This reverts commit 93f1c20b.
      
      It turns out that libmount misparses it because it adds a '-' character
      in the uuid string, which libmount then incorrectly confuses with the
      separator string (" - ") at the end of all the optional arguments.
      
      Upstream libmount (in the util-linux tree) has been fixed, but until
      that fix actually percolates up to users, we'd better not expose this
      change in the kernel.
      
      Let's revisit this later (possibly by exposing the UUID without any '-'
      characters in it, avoiding the user-space bug).
      Reported-by: NDave Jones <davej@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be85bcca
  13. 06 4月, 2011 1 次提交
    • J
      fs: export empty_aops · 7dcda1c9
      Jens Axboe 提交于
      With the ->sync_page() hook gone, we have a few users that
      add their own static address_space_operations without any
      functions defined.
      
      fs/inode.c already has an empty_aops that it uses for init
      purposes. Lets export that and use it in the places where
      an otherwise empty aops was defined.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7dcda1c9
  14. 31 3月, 2011 1 次提交
  15. 28 3月, 2011 1 次提交
  16. 25 3月, 2011 1 次提交
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  17. 24 3月, 2011 2 次提交
  18. 23 3月, 2011 1 次提交