1. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  2. 21 6月, 2011 1 次提交
    • L
      vfs: i_state needs to be 'unsigned long' for now · 79568f5b
      Linus Torvalds 提交于
      Commit 13e12d14 ("vfs: reorganize 'struct inode' layout a bit")
      moved things around a bit changed i_state to be unsigned int instead of
      unsigned long.  That was to help structure layout for the 64-bit case,
      and shrink 'struct inode' a bit (admittedly that only happened when
      spinlock debugging was on and i_flags didn't pack with i_lock).
      
      However, Meelis Roos reports that this results in unaligned exceptions
      on sprc, and it turns out that the bit-locking primitives that we use
      for the I_NEW bit want to use the bitops.  Which want 'unsigned long',
      not 'unsigned int'.
      
      We really should fix the bit locking code to not have that kind of
      requirement, but that's a much bigger change.  So for now, revert that
      field back to 'unsigned long' (but keep the other re-ordering changes
      from the commit that caused this).
      
      Andi points out that we have played games with this in 'struct page', so
      it's solvable with other hacks too, but since right now the struct inode
      size advantage only happens with some rare config options, it's not
      worth fighting.
      
      It _would_ be worth fixing the bitlocking code, though.  Especially
      since there is no type safety in the bitlocking code (this never caused
      any warnings, and worked fine on x86-64, because the bitlocks take a
      'void *' and x86-64 doesn't care that deeply about alignment).  So it's
      currently a very easy problem to trigger by mistake and never notice.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79568f5b
  3. 09 6月, 2011 1 次提交
    • L
      vfs: reorganize 'struct inode' layout a bit · 13e12d14
      Linus Torvalds 提交于
      This tries to make the 'struct inode' accesses denser in the data cache
      by moving a commonly accessed field (i_security) closer to other fields
      that are accessed often.
      
      It also makes 'i_state' just an 'unsigned int' rather than 'unsigned
      long', since we only use a few bits of that field, and moves it next to
      the existing 'i_flags' so that we potentially get better structure
      layout (although depending on config options, i_flags may already have
      packed in the same word as i_lock, so this improves packing only for the
      case of spinlock debugging)
      
      Out 'struct inode' is still way too big, and we should probably move
      some other fields around too (the acl fields in particular) for better
      data cache access density.  Other fields (like the inode hash) are
      likely to be entirely irrelevant under most loads.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e12d14
  4. 04 6月, 2011 1 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
  5. 29 5月, 2011 1 次提交
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  6. 27 5月, 2011 2 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
    • D
      fs: add field to superblock to support cleancache · 9fdfdcf1
      Dan Magenheimer 提交于
      This second patch of eight in this cleancache series adds a field to
      the generic superblock to squirrel away a pool identifier that is
      dynamically provided by cleancache-enabled filesystems at mount time
      to uniquely identify files and pages belonging to this mounted filesystem.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: trivial merge conflict update]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      9fdfdcf1
  7. 25 5月, 2011 3 次提交
    • T
      ulimit: raise default hard ulimit on number of files to 4096 · 0ac1ee0b
      Tim Gardner 提交于
      Apps are increasingly using more than 1024 file descriptors.  See
      discussion in several distro bug trackers, e.g.  BugLink:
      http://bugs.launchpad.net/bugs/663090
      https://issues.rpath.com/browse/RPL-2054
      
      You don't want to raise the default soft limit, since that might break
      apps that use select(), but it's safe to raise the default hard limit;
      that way, apps that know they need lots of file descriptors can raise
      their soft limit without needing root, and without user intervention.
      
      Ubuntu is doing this with a kernel change because they have a policy of
      not changing kernel defaults in userland.
      
      While 4096 might not be enough for *all* apps, it seems to be plenty for
      the apps I've seen lately that are unhappy with 1024.
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Cc: Dan Kegel <dank@kegel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ac1ee0b
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
  8. 15 5月, 2011 1 次提交
    • L
      fs: remove FS_COW_FL · e1e8fb6a
      Li Zefan 提交于
      FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
      COW in btrfs, but FS_NOCOW_FL is sufficient.
      
      The fact is we don't have corresponding BTRFS_INODE_COW flag.
      
      COW is default, and FS_NOCOW_FL can be used to switch off COW for
      a single file.
      
      If we mount btrfs with nodatacow, a newly created file will be set with
      the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
      FS_NOCOW_FL flag.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e1e8fb6a
  9. 13 4月, 2011 2 次提交
    • L
      vfs: Re-introduce s_uuid in the superblock · 0bba0169
      Linus Torvalds 提交于
      Gaah.  When commit be85bcca reverted the export of file system uuid
      via /proc/<pid>/mountinfo, it also unintentionally removed the s_uuid
      field in struct super_block.
      
      I didn't mean to do that, since filesystems have been taught to fill it
      in (and we want to keep it for future re-introduction in the mountinfo
      file).
      
      Stupid of me. This adds it back in.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bba0169
    • L
      Revert "vfs: Export file system uuid via /proc/<pid>/mountinfo" · be85bcca
      Linus Torvalds 提交于
      This reverts commit 93f1c20b.
      
      It turns out that libmount misparses it because it adds a '-' character
      in the uuid string, which libmount then incorrectly confuses with the
      separator string (" - ") at the end of all the optional arguments.
      
      Upstream libmount (in the util-linux tree) has been fixed, but until
      that fix actually percolates up to users, we'd better not expose this
      change in the kernel.
      
      Let's revisit this later (possibly by exposing the UUID without any '-'
      characters in it, avoiding the user-space bug).
      Reported-by: NDave Jones <davej@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be85bcca
  10. 06 4月, 2011 1 次提交
    • J
      fs: export empty_aops · 7dcda1c9
      Jens Axboe 提交于
      With the ->sync_page() hook gone, we have a few users that
      add their own static address_space_operations without any
      functions defined.
      
      fs/inode.c already has an empty_aops that it uses for init
      purposes. Lets export that and use it in the places where
      an otherwise empty aops was defined.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7dcda1c9
  11. 31 3月, 2011 1 次提交
  12. 28 3月, 2011 1 次提交
  13. 25 3月, 2011 1 次提交
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  14. 24 3月, 2011 2 次提交
  15. 23 3月, 2011 1 次提交
  16. 18 3月, 2011 1 次提交
  17. 17 3月, 2011 2 次提交
  18. 15 3月, 2011 3 次提交
    • A
      New kind of open files - "location only". · 1abf0c71
      Al Viro 提交于
      New flag for open(2) - O_PATH.  Semantics:
      	* pathname is resolved, but the file itself is _NOT_ opened
      as far as filesystem is concerned.
      	* almost all operations on the resulting descriptors shall
      fail with -EBADF.  Exceptions are:
      	1) operations on descriptors themselves (i.e.
      		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
      		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
      		fcntl(fd, F_SETFD, ...))
      	2) fcntl(fd, F_GETFL), for a common non-destructive way to
      		check if descriptor is open
      	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
      		points of pathname resolution
      	* closing such descriptor does *NOT* affect dnotify or
      posix locks.
      	* permissions are checked as usual along the way to file;
      no permission checks are applied to the file itself.  Of course,
      giving such thing to syscall will result in permission checks (at
      the moment it means checking that starting point of ....at() is
      a directory and caller has exec permissions on it).
      
      fget() and fget_light() return NULL on such descriptors; use of
      fget_raw() and fget_raw_light() is needed to get them.  That protects
      existing code from dealing with those things.
      
      There are two things still missing (they come in the next commits):
      one is handling of symlinks (right now we refuse to open them that
      way; see the next commit for semantics related to those) and another
      is descriptor passing via SCM_RIGHTS datagrams.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1abf0c71
    • A
      vfs: Export file system uuid via /proc/<pid>/mountinfo · 93f1c20b
      Aneesh Kumar K.V 提交于
      We add a per superblock uuid field. File systems should
      update the uuid in the fill_super callback
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      93f1c20b
    • A
      vfs: Add name to file handle conversion support · 990d6c2d
      Aneesh Kumar K.V 提交于
      The syscall also return mount id which can be used
      to lookup file system specific information such as uuid
      in /proc/<pid>/mountinfo
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      990d6c2d
  19. 14 3月, 2011 3 次提交
    • A
      clean statfs-like syscalls up · c8b91acc
      Al Viro 提交于
      New helpers: user_statfs() and fd_statfs(), taking userland pathname and
      descriptor resp. and filling struct kstatfs.  Syscalls of statfs family
      (native, compat and foreign - osf and hpux on alpha and parisc resp.)
      switched to those.  Removes some boilerplate code, simplifies cleanup
      on errors...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c8b91acc
    • A
      open-style analog of vfs_path_lookup() · 73d049a4
      Al Viro 提交于
      new function: file_open_root(dentry, mnt, name, flags) opens the file
      vfs_path_lookup would arrive to.
      
      Note that name can be empty; in that case the usual requirement that
      dentry should be a directory is lifted.
      
      open-coded equivalents switched to it, may_open() got down exactly
      one caller and became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      73d049a4
    • A
      switch do_filp_open() to struct open_flags · 47c805dc
      Al Viro 提交于
      take calculation of open_flags by open(2) arguments into new helper
      in fs/open.c, move filp_open() over there, have it and do_sys_open()
      use that helper, switch exec.c callers of do_filp_open() to explicit
      (and constant) struct open_flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      47c805dc
  20. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  21. 24 2月, 2011 2 次提交
    • N
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown 提交于
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      93b270f7
    • M
      mm: prevent concurrent unmap_mapping_range() on the same inode · 2aa15890
      Miklos Szeredi 提交于
      Michael Leun reported that running parallel opens on a fuse filesystem
      can trigger a "kernel BUG at mm/truncate.c:475"
      
      Gurudas Pai reported the same bug on NFS.
      
      The reason is, unmap_mapping_range() is not prepared for more than
      one concurrent invocation per inode.  For example:
      
        thread1: going through a big range, stops in the middle of a vma and
           stores the restart address in vm_truncate_count.
      
        thread2: comes in with a small (e.g. single page) unmap request on
           the same vma, somewhere before restart_address, finds that the
           vma was already unmapped up to the restart address and happily
           returns without doing anything.
      
      Another scenario would be two big unmap requests, both having to
      restart the unmapping and each one setting vm_truncate_count to its
      own value.  This could go on forever without any of them being able to
      finish.
      
      Truncate and hole punching already serialize with i_mutex.  Other
      callers of unmap_mapping_range() do not, and it's difficult to get
      i_mutex protection for all callers.  In particular ->d_revalidate(),
      which calls invalidate_inode_pages2_range() in fuse, may be called
      with or without i_mutex.
      
      This patch adds a new mutex to 'struct address_space' to prevent
      running multiple concurrent unmap_mapping_range() on the same mapping.
      
      [ We'll hopefully get rid of all this with the upcoming mm
        preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
        lockbreak" patch in particular.  But that is for 2.6.39 ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Reported-by: NMichael Leun <lkml20101129@newton.leun.net>
      Reported-by: NGurudas Pai <gurudas.pai@oracle.com>
      Tested-by: NGurudas Pai <gurudas.pai@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aa15890
  22. 10 2月, 2011 2 次提交
  23. 03 2月, 2011 2 次提交
  24. 17 1月, 2011 2 次提交
    • N
      fs: fix address space warnings in ioctl_fiemap() · ecf5632d
      Namhyung Kim 提交于
      The fi_extents_start field of struct fiemap_extent_info is a
      user pointer but was not marked as __user. This makes sparse
      emit following warnings:
      
        CHECK   fs/ioctl.c
      fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
      fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
      fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
      fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:212:27:    got char *<noident>
      
      Also add 'ufiemap' variable to eliminate unnecessary casts.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf5632d
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
  25. 16 1月, 2011 1 次提交
    • D
      Add a dentry op to handle automounting rather than abusing follow_link() · 9875cf80
      David Howells 提交于
      Add a dentry op (d_automount) to handle automounting directories rather than
      abusing the follow_link() inode operation.  The operation is keyed off a new
      dentry flag (DCACHE_NEED_AUTOMOUNT).
      
      This also makes it easier to add an AT_ flag to suppress terminal segment
      automount during pathwalk and removes the need for the kludge code in the
      pathwalk algorithm to handle directories with follow_link() semantics.
      
      The ->d_automount() dentry operation:
      
      	struct vfsmount *(*d_automount)(struct path *mountpoint);
      
      takes a pointer to the directory to be mounted upon, which is expected to
      provide sufficient data to determine what should be mounted.  If successful, it
      should return the vfsmount struct it creates (which it should also have added
      to the namespace using do_add_mount() or similar).  If there's a collision with
      another automount attempt, NULL should be returned.  If the directory specified
      by the parameter should be used directly rather than being mounted upon,
      -EISDIR should be returned.  In any other case, an error code should be
      returned.
      
      The ->d_automount() operation is called with no locks held and may sleep.  At
      this point the pathwalk algorithm will be in ref-walk mode.
      
      Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
      added to handle mountpoints.  It will return -EREMOTE if the automount flag was
      set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
      symlinks or mountpoints, -EISDIR if the walk point should be used without
      mounting and 0 if successful.  The path will be updated to point to the mounted
      filesystem if a successful automount took place.
      
      __follow_mount() is replaced by follow_managed() which is more generic
      (especially with the patch that adds ->d_manage()).  This handles transits from
      directories during pathwalk, including automounting and skipping over
      mountpoints (and holding processes with the next patch).
      
      __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
      automount point with nothing mounted on it.
      
      follow_dotdot*() does not handle automounts as you don't want to trigger them
      whilst following "..".
      
      I've also extracted the mount/don't-mount logic from autofs4 and included it
      here.  It makes the mount go ahead anyway if someone calls open() or creat(),
      tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
      or sticks a '/' on the end of the pathname.  If they do a stat(), however,
      they'll only trigger the automount if they didn't also say O_NOFOLLOW.
      
      I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
      inodes as automount points.  This flag is automatically propagated to the
      dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
      save AFS a private flag bit apiece, but is not strictly necessary.  It would be
      preferable to do the propagation in d_set_d_op(), but that doesn't normally
      have access to the inode.
      
      [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
      succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
      that, rather than just returning with ungrabbed *path]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9875cf80