1. 28 5月, 2010 5 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
    • C
      rename the generic fsync implementations · 1b061d92
      Christoph Hellwig 提交于
      We don't name our generic fsync implementations very well currently.
      The no-op implementation for in-memory filesystems currently is called
      simple_sync_file which doesn't make too much sense to start with,
      the the generic one for simple filesystems is called simple_fsync
      which can lead to some confusion.
      
      This patch renames the generic file fsync method to generic_file_fsync
      to match the other generic_file_* routines it is supposed to be used
      with, and the no-op implementation to noop_fsync to make it obvious
      what to expect.  In addition add some documentation for both methods.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b061d92
    • C
      drop unused dentry argument to ->fsync · 7ea80859
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea80859
    • A
      get rid of the magic around f_count in aio · d7065da0
      Al Viro 提交于
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
    • J
      vfs: introduce noop_llseek() · ae6afc3f
      jan Blunck 提交于
      This is an implementation of ->llseek useable for the rare special case
      when userspace expects the seek to succeed but the (device) file is
      actually not able to perform the seek.  In this case you use noop_llseek()
      instead of falling back to the default implementation of ->llseek.
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6afc3f
  2. 25 5月, 2010 1 次提交
    • J
      direct-io: add a hook for the fs to provide its own submit_bio function · facd07b0
      Josef Bacik 提交于
      Because BTRFS can do RAID and such, we need our own submit hook so we can setup
      the bio's in the correct fashion, and handle checksum errors properly.  So there
      are a few changes here
      
      1) The submit_io hook.  This is straightforward, just call this instead of
      submit_bio.
      
      2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
      writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
      of falling back on buffered IO if it encounters a compressed extent, since we
      need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
      from get_block we'll return back and fallback on buffered just like the write
      case.
      
      I've tested these changes with fsx and everything seems to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      facd07b0
  3. 22 5月, 2010 10 次提交
    • D
      vfs: Add inode uid,gid,mode init helper · a1bd120d
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a1bd120d
    • R
      vfs: add lockdep annotation to s_vfs_rename_key for ecryptfs · 51ee049e
      Roland Dreier 提交于
       >  =============================================
       >  [ INFO: possible recursive locking detected ]
       >  2.6.31-2-generic #14~rbd3
       >  ---------------------------------------------
       >  firefox-3.5/4162 is trying to acquire lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  but task is already holding lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  other info that might help us debug this:
       >  3 locks held by firefox-3.5/4162:
       >   #0:  (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   #1:  (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [<ffffffff81139d5a>] lock_rename+0x6a/0xf0
       >   #2:  (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [<ffffffff81139d6f>] lock_rename+0x7f/0xf0
       >
       >  stack backtrace:
       >  Pid: 4162, comm: firefox-3.5 Tainted: G         C 2.6.31-2-generic #14~rbd3
       >  Call Trace:
       >   [<ffffffff8108ae74>] print_deadlock_bug+0xf4/0x100
       >   [<ffffffff8108ce26>] validate_chain+0x4c6/0x750
       >   [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
       >   [<ffffffff8108d585>] lock_acquire+0xa5/0x150
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff8120eaf9>] ? ecryptfs_rename+0x99/0x170
       >   [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
       >   [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   [<ffffffff8120eb2a>] ecryptfs_rename+0xca/0x170
       >   [<ffffffff81139a9e>] vfs_rename_dir+0x13e/0x160
       >   [<ffffffff8113ac7e>] vfs_rename+0xee/0x290
       >   [<ffffffff8113c212>] ? __lookup_hash+0x102/0x160
       >   [<ffffffff8113d512>] sys_renameat+0x252/0x280
       >   [<ffffffff81133eb4>] ? cp_new_stat+0xe4/0x100
       >   [<ffffffff8101316a>] ? sysret_check+0x2e/0x69
       >   [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
       >   [<ffffffff8113d55b>] sys_rename+0x1b/0x20
       >   [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
      
      The trace above is totally reproducible by doing a cross-directory
      rename on an ecryptfs directory.
      
      The issue seems to be that sys_renameat() does lock_rename() then calls
      into the filesystem; if the filesystem is ecryptfs, then
      ecryptfs_rename() again does lock_rename() on the lower filesystem, and
      lockdep can't tell that the two s_vfs_rename_mutexes are different.  It
      seems an annotation like the following is sufficient to fix this (it
      does get rid of the lockdep trace in my simple tests); however I would
      like to make sure I'm not misunderstanding the locking, hence the CC
      list...
      Signed-off-by: NRoland Dreier <rdreier@cisco.com>
      Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
      Cc: Dustin Kirkland <kirkland@canonical.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      51ee049e
    • C
      sanitize vfs_fsync calling conventions · 8018ab05
      Christoph Hellwig 提交于
      Now that the last user passing a NULL file pointer is gone we can remove
      the redundant dentry argument and associated hacks inside vfs_fsynmc_range.
      
      The next step will be removig the dentry argument from ->fsync, but given
      the luck with the last round of method prototype changes I'd rather
      defer this until after the main merge window.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8018ab05
    • S
      fs: xattr_handler table should be const · bb435453
      Stephen Hemminger 提交于
      The entries in xattr handler table should be immutable (ie const)
      like other operation tables.
      
      Later patches convert common filesystems. Uncoverted filesystems
      will still work, but will generate a compiler warning.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb435453
    • J
      Introduce freeze_super and thaw_super for the fsfreeze ioctl · 18e9e510
      Josef Bacik 提交于
      Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
      letting it do all the work.  But freezing is more of an fs thing, and doesn't
      really have much to do with the bdev at all, all the work gets done with the
      super.  In btrfs we do not populate s_bdev, since we can have multiple bdev's
      for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
      This means that freezing a btrfs filesystem fails, which causes us to corrupt
      with things like tux-on-ice which use the fsfreeze mechanism.  So instead of
      populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
      freezing stuff into freeze_super and thaw_super.  These just take the
      super_block that we're freezing and does the appropriate work.  It's basically
      just copy and pasted from freeze_bdev.  I've then converted freeze_bdev over to
      use the new super helpers.  I've tested this with ext4 and btrfs and verified
      everything continues to work the same as before.
      
      The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
      the fs is already frozen.  I thought this was a better solution than adding a
      freeze counter to the super_block, but if everybody hates this idea I'm open to
      suggestions.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      18e9e510
    • A
      new helper: iterate_supers() · 01a05b33
      Al Viro 提交于
      ... and switch the simple "loop over superblocks and do something"
      loops to it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      01a05b33
    • A
      Bury __put_super_and_need_restart() · 35cf7ba0
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35cf7ba0
    • A
      get rid of restarts in sync_filesystems() · 8edd64bd
      Al Viro 提交于
      At the same time we can kill s_need_restart and local mutex in there.
      __put_super() made public for a while; will be gone later.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8edd64bd
    • A
      get rid of S_BIAS · b20bd1a5
      Al Viro 提交于
      use atomic_inc_not_zero(&sb->s_active) instead of playing games with
      checking ->s_count > S_BIAS
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b20bd1a5
    • A
  4. 11 5月, 2010 1 次提交
  5. 27 4月, 2010 1 次提交
    • T
      block: implement bd_claiming and claiming block · 6b4517a7
      Tejun Heo 提交于
      Currently, device claiming for exclusive open is done after low level
      open - disk->fops->open() - has completed successfully.  This means
      that exclusive open attempts while a device is already exclusively
      open will fail only after disk->fops->open() is called.
      
      cdrom driver issues commands during open() which means that O_EXCL
      open attempt can unintentionally inject commands to in-progress
      command stream for burning thus disturbing burning process.  In most
      cases, this doesn't cause problems because the first command to be
      issued is TUR which most devices can process in the middle of burning.
      However, depending on how a device replies to TUR during burning,
      cdrom driver may end up issuing further commands.
      
      This can't be resolved trivially by moving bd_claim() before doing
      actual open() because that means an open attempt which will end up
      failing could interfere other legit O_EXCL open attempts.
      ie. unconfirmed open attempts can fail others.
      
      This patch resolves the problem by introducing claiming block which is
      started by bd_start_claiming() and terminated either by bd_claim() or
      bd_abort_claiming().  bd_claim() from inside a claiming block is
      guaranteed to succeed and once a claiming block is started, other
      bd_start_claiming() or bd_claim() attempts block till the current
      claiming block is terminated.
      
      bd_claim() can still be used standalone although now it always
      synchronizes against claiming blocks, so the existing users will keep
      working without any change.
      
      blkdev_open() and open_bdev_exclusive() are converted to use claiming
      blocks so that exclusive open attempts from these functions don't
      interfere with the existing exclusive open.
      
      This problem was discovered while investigating bko#15403.
      
        https://bugzilla.kernel.org/show_bug.cgi?id=15403
      
      The burning problem itself can be resolved by updating userspace
      probing tools to always open w/ O_EXCL.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMatthias-Christian Ott <ott@mirix.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6b4517a7
  6. 24 4月, 2010 1 次提交
  7. 22 4月, 2010 1 次提交
    • E
      fasync: RCU and fine grained locking · 989a2979
      Eric Dumazet 提交于
      kill_fasync() uses a central rwlock, candidate for RCU conversion, to
      avoid cache line ping pongs on SMP.
      
      fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
      section instead during whole list scan.
      
      Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
      fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
      doesnt need its own implementation and can use fasync_helper(), to
      reduce code size and complexity.
      
      We can remove __kill_fasync() direct use in net/socket.c, and rename it
      to kill_fasync_rcu().
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      989a2979
  8. 07 4月, 2010 2 次提交
  9. 07 3月, 2010 2 次提交
    • A
      include/linux/fs.h: convert FMODE_* constants to hex · 19adf9c5
      Andrew Morton 提交于
      It was tolerable until Eric went and added 8388608.
      
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19adf9c5
    • W
      readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM · 0141450f
      Wu Fengguang 提交于
      This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
      
      POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
      a 16K read will be carried out in 4 _sync_ 1-page reads.
      
      In other places, ra_pages==0 means
      - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
      - some IO error happened
      where multi-page read IO won't help or should be avoided.
      
      POSIX_FADV_RANDOM actually want a different semantics: to disable the
      *heuristic* readahead algorithm, and to use a dumb one which faithfully
      submit read IO for whatever application requests.
      
      So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
      
      Note that the random hint is not likely to help random reads performance
      noticeably.  And it may be too permissive on huge request size (its IO
      size is not limited by read_ahead_kb).
      
      In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
      (NFS read) performance of the application increased by 313%!
      Tested-by: NQuentin Barnes <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Cc: <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0141450f
  10. 06 3月, 2010 1 次提交
  11. 04 3月, 2010 6 次提交
  12. 19 2月, 2010 1 次提交
  13. 14 1月, 2010 1 次提交
    • A
      Fix ACC_MODE() for real · 6d125529
      Al Viro 提交于
      commit 5300990c had stepped on a rather
      nasty mess: definitions of ACC_MODE used to be different.  Fixed the
      resulting breakage, converting them to variant that takes O_... value;
      all callers have that and it actually simplifies life (see tomoyo part
      of changes).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6d125529
  14. 23 12月, 2009 3 次提交
  15. 18 12月, 2009 2 次提交
  16. 17 12月, 2009 2 次提交
    • D
      xfs: improve metadata I/O merging in the elevator · 2ee1abad
      Dave Chinner 提交于
      Change all async metadata buffers to use [READ|WRITE]_META I/O types
      so that the I/O doesn't get issued immediately. This allows merging of
      adjacent metadata requests but still prioritises them over bulk data.
      This shows a 10-15% improvement in sequential create speed of small
      files.
      
      Don't include the log buffers in this classification - leave them as
      sync types so they are issued immediately.
      Signed-off-by: NDave Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      2ee1abad
    • C
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c