1. 09 9月, 2013 1 次提交
  2. 06 9月, 2013 1 次提交
    • M
      vfs: check unlinked ancestors before mount · eed81007
      Miklos Szeredi 提交于
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.  Nor do we
      prevent mounts to be added to the disconnected subtree using relative paths
      after the d_drop().
      
      This patch fixes these issues by checking for unlinked (unhashed, non-root)
      ancestors before proceeding with the mount.  This is done with rename
      seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
      including our dentry itself.  This ensures that the only one of
      check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eed81007
  3. 29 6月, 2013 2 次提交
  4. 20 6月, 2013 1 次提交
  5. 10 4月, 2013 1 次提交
  6. 22 3月, 2013 1 次提交
  7. 02 3月, 2013 1 次提交
  8. 27 11月, 2012 1 次提交
    • J
      writeback: put unused inodes to LRU after writeback completion · 4eff96dd
      Jan Kara 提交于
      Commit 169ebd90 ("writeback: Avoid iput() from flusher thread")
      removed iget-iput pair from inode writeback.  As a side effect, inodes
      that are dirty during iput_final() call won't be ever added to inode LRU
      (iput_final() doesn't add dirty inodes to LRU and later when the inode
      is cleaned there's noone to add the inode there).  Thus inodes are
      effectively unreclaimable until someone looks them up again.
      
      The practical effect of this bug is limited by the fact that inodes are
      pinned by a dentry for long enough that the inode gets cleaned.  But
      still the bug can have nasty consequences leading up to OOM conditions
      under certain circumstances.  Following can easily reproduce the
      problem:
      
        for (( i = 0; i < 1000; i++ )); do
          mkdir $i
          for (( j = 0; j < 1000; j++ )); do
            touch $i/$j
            echo 2 > /proc/sys/vm/drop_caches
          done
        done
      
      then one needs to run 'sync; ls -lR' to make inodes reclaimable again.
      
      We fix the issue by inserting unused clean inodes into the LRU after
      writeback finishes in inode_sync_complete().
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>		[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eff96dd
  9. 13 10月, 2012 1 次提交
  10. 31 7月, 2012 1 次提交
  11. 14 7月, 2012 6 次提交
    • D
      VFS: Split inode_permission() · 0bdaea90
      David Howells 提交于
      Split inode_permission() into inode- and superblock-dependent parts.
      
      This is aimed at unionmounts where the superblock from the upper layer has to
      be checked rather than the superblock from the lower layer as the upper layer
      may be writable, thus allowing an unwritable file from the lower layer to be
      copied up and modified.
      
      Original-author: Valerie Aurora <vaurora@redhat.com>
      Signed-off-by: David Howells <dhowells@redhat.com> (Further development)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0bdaea90
    • A
      kill struct opendata · 30d90494
      Al Viro 提交于
      Just pass struct file *.  Methods are happier that way...
      There's no need to return struct file * from finish_open() now,
      so let it return int.  Next: saner prototypes for parts in
      namei.c
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      30d90494
    • A
      kill opendata->{mnt,dentry} · a4a3bdd7
      Al Viro 提交于
      ->filp->f_path is there for purpose...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4a3bdd7
    • M
      vfs: remove open intents from nameidata · 015c3bbc
      Miklos Szeredi 提交于
      All users of open intents have been converted to use ->atomic_{open,create}.
      
      This patch gets rid of nd->intent.open and related infrastructure.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      015c3bbc
    • M
      vfs: add i_op->atomic_open() · d18e9008
      Miklos Szeredi 提交于
      Add a new inode operation which is called on the last component of an open.
      Using this the filesystem can look up, possibly create and open the file in one
      atomic operation.  If it cannot perform this (e.g. the file type turned out to
      be wrong) it may signal this by returning NULL instead of an open struct file
      pointer.
      
      i_op->atomic_open() is only called if the last component is negative or needs
      lookup.  Handling cached positive dentries here doesn't add much value: these
      can be opened using f_op->open().  If the cached file turns out to be invalid,
      the open can be retried, this time using ->atomic_open() with a fresh dentry.
      
      For now leave the old way of using open intents in lookup and revalidate in
      place.  This will be removed once all the users are converted.
      
      David Howells noticed that if ->atomic_open() opens the file but does not create
      it, handle_truncate() will be called on it even if it is not a regular file.
      Fix this by checking the file type in this case too.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d18e9008
    • A
      get rid of ->mnt_longterm · f7a99c5b
      Al Viro 提交于
      it's enough to set ->mnt_ns of internal vfsmounts to something
      distinct from all struct mnt_namespace out there; then we can
      just use the check for ->mnt_ns != NULL in the fast path of
      mntput_no_expire()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f7a99c5b
  12. 02 6月, 2012 1 次提交
  13. 30 5月, 2012 1 次提交
    • A
      brlocks/lglocks: turn into functions · eea62f83
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      Since there are at least two users it makes sense to share this code in a
      library.  This is also easier maintainable than a macro forest.
      
      This will also make it later possible to dynamically allocate lglocks and
      also use them in modules (this would both still need some additional, but
      now straightforward, code)
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eea62f83
  14. 07 1月, 2012 1 次提交
    • M
      vfs: protect remounting superblock read-only · 4ed5e82f
      Miklos Szeredi 提交于
      Currently remouting superblock read-only is racy in a major way.
      
      With the per mount read-only infrastructure it is now possible to
      prevent most races, which this patch attempts.
      
      Before starting the remount read-only, iterate through all mounts
      belonging to the superblock and if none of them have any pending
      writes, set sb->s_readonly_remount.  This indicates that remount is in
      progress and no further write requests are allowed.  If the remount
      succeeds set MS_RDONLY and reset s_readonly_remount.
      
      If the remounting is unsuccessful just reset s_readonly_remount.
      This can result in transient EROFS errors, despite the fact the
      remount failed.  Unfortunately hodling off writes is difficult as
      remount itself may touch the filesystem (e.g. through load_nls())
      which would deadlock.
      
      A later patch deals with delayed writes due to nlink going to zero.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Tested-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ed5e82f
  15. 04 1月, 2012 4 次提交
  16. 20 7月, 2011 2 次提交
    • D
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner 提交于
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
    • A
      Make ->d_sb assign-once and always non-NULL · a4464dbc
      Al Viro 提交于
      New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
      Allocates dentry, sets its ->d_sb to given superblock and sets
      ->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
      to that (all of them know what superblock they want).  d_alloc()
      itself is left only for parent != NULl case; uses __d_alloc(),
      inserts result into the list of parent's children.
      
      Note that now ->d_sb is assign-once and never NULL *and*
      ->d_parent is never NULL either.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4464dbc
  17. 25 3月, 2011 2 次提交
  18. 22 3月, 2011 1 次提交
  19. 18 3月, 2011 1 次提交
    • A
      vfs: split off vfsmount-related parts of vfs_kern_mount() · 9d412a43
      Al Viro 提交于
      new function: mount_fs().  Does all work done by vfs_kern_mount()
      except the allocation and filling of vfsmount; returns root dentry
      or ERR_PTR().
      
      vfs_kern_mount() switched to using it and taken to fs/namespace.c,
      along with its wrappers.
      
      alloc_vfsmnt()/free_vfsmnt() made static.
      
      functions in namespace.c slightly reordered.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9d412a43
  20. 15 3月, 2011 1 次提交
  21. 14 3月, 2011 2 次提交
    • A
      open-style analog of vfs_path_lookup() · 73d049a4
      Al Viro 提交于
      new function: file_open_root(dentry, mnt, name, flags) opens the file
      vfs_path_lookup would arrive to.
      
      Note that name can be empty; in that case the usual requirement that
      dentry should be a directory is lifted.
      
      open-coded equivalents switched to it, may_open() got down exactly
      one caller and became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      73d049a4
    • A
      switch do_filp_open() to struct open_flags · 47c805dc
      Al Viro 提交于
      take calculation of open_flags by open(2) arguments into new helper
      in fs/open.c, move filp_open() over there, have it and do_sys_open()
      use that helper, switch exec.c callers of do_filp_open() to explicit
      (and constant) struct open_flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      47c805dc
  22. 24 2月, 2011 1 次提交
    • N
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown 提交于
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      93b270f7
  23. 17 1月, 2011 3 次提交
    • A
      tidy up around finish_automount() · b1e75df4
      Al Viro 提交于
      do_add_mount() and mnt_clear_expiry() are not needed outside of
      namespace.c anymore, now that namei has finish_automount() to
      use.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b1e75df4
    • A
      Take the completion of automount into new helper · 19a167af
      Al Viro 提交于
      ... and shift it from namei.c to namespace.c
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19a167af
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
  24. 16 1月, 2011 1 次提交
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
  25. 07 1月, 2011 1 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
  26. 29 10月, 2010 1 次提交