1. 13 1月, 2011 1 次提交
  2. 07 1月, 2011 1 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  3. 18 11月, 2010 1 次提交
  4. 29 10月, 2010 1 次提交
  5. 26 10月, 2010 3 次提交
  6. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  7. 11 8月, 2010 1 次提交
  8. 10 8月, 2010 3 次提交
  9. 08 8月, 2010 1 次提交
    • A
      block: push down BKL into .open and .release · 6e9624b8
      Arnd Bergmann 提交于
      The open and release block_device_operations are currently
      called with the BKL held. In order to change that, we must
      first make sure that all drivers that currently rely
      on this have no regressions.
      
      This blindly pushes the BKL into all .open and .release
      operations for all block drivers to prepare for the
      next step. The drivers can subsequently replace the BKL
      with their own locks or remove it completely when it can
      be shown that it is not needed.
      
      The functions blkdev_get and blkdev_put are the only
      remaining users of the big kernel lock in the block
      layer, besides a few uses in the ioctl code, none
      of which need to serialize with blkdev_{get,put}.
      
      Most of these two functions is also under the protection
      of bdev->bd_mutex, including the actual calls to
      ->open and ->release, and the common code does not
      access any global data structures that need the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6e9624b8
  10. 05 8月, 2010 1 次提交
  11. 11 6月, 2010 3 次提交
  12. 28 5月, 2010 2 次提交
  13. 22 5月, 2010 2 次提交
    • J
      Introduce freeze_super and thaw_super for the fsfreeze ioctl · 18e9e510
      Josef Bacik 提交于
      Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
      letting it do all the work.  But freezing is more of an fs thing, and doesn't
      really have much to do with the bdev at all, all the work gets done with the
      super.  In btrfs we do not populate s_bdev, since we can have multiple bdev's
      for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
      This means that freezing a btrfs filesystem fails, which causes us to corrupt
      with things like tux-on-ice which use the fsfreeze mechanism.  So instead of
      populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
      freezing stuff into freeze_super and thaw_super.  These just take the
      super_block that we're freezing and does the appropriate work.  It's basically
      just copy and pasted from freeze_bdev.  I've then converted freeze_bdev over to
      use the new super helpers.  I've tested this with ext4 and btrfs and verified
      everything continues to work the same as before.
      
      The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
      the fs is already frozen.  I thought this was a better solution than adding a
      freeze counter to the super_block, but if everybody hates this idea I'm open to
      suggestions.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      18e9e510
    • A
      Move grabbing s_umount to callers of grab_super() · d3f21473
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d3f21473
  14. 29 4月, 2010 1 次提交
  15. 27 4月, 2010 2 次提交
    • T
      block: implement bd_claiming and claiming block · 6b4517a7
      Tejun Heo 提交于
      Currently, device claiming for exclusive open is done after low level
      open - disk->fops->open() - has completed successfully.  This means
      that exclusive open attempts while a device is already exclusively
      open will fail only after disk->fops->open() is called.
      
      cdrom driver issues commands during open() which means that O_EXCL
      open attempt can unintentionally inject commands to in-progress
      command stream for burning thus disturbing burning process.  In most
      cases, this doesn't cause problems because the first command to be
      issued is TUR which most devices can process in the middle of burning.
      However, depending on how a device replies to TUR during burning,
      cdrom driver may end up issuing further commands.
      
      This can't be resolved trivially by moving bd_claim() before doing
      actual open() because that means an open attempt which will end up
      failing could interfere other legit O_EXCL open attempts.
      ie. unconfirmed open attempts can fail others.
      
      This patch resolves the problem by introducing claiming block which is
      started by bd_start_claiming() and terminated either by bd_claim() or
      bd_abort_claiming().  bd_claim() from inside a claiming block is
      guaranteed to succeed and once a claiming block is started, other
      bd_start_claiming() or bd_claim() attempts block till the current
      claiming block is terminated.
      
      bd_claim() can still be used standalone although now it always
      synchronizes against claiming blocks, so the existing users will keep
      working without any change.
      
      blkdev_open() and open_bdev_exclusive() are converted to use claiming
      blocks so that exclusive open attempts from these functions don't
      interfere with the existing exclusive open.
      
      This problem was discovered while investigating bko#15403.
      
        https://bugzilla.kernel.org/show_bug.cgi?id=15403
      
      The burning problem itself can be resolved by updating userspace
      probing tools to always open w/ O_EXCL.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMatthias-Christian Ott <ott@mirix.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6b4517a7
    • T
      block: factor out bd_may_claim() · 1a3cbbc5
      Tejun Heo 提交于
      Factor out bd_may_claim() from bd_claim(), add comments and apply a
      couple of cosmetic edits.  This is to prepare for further updates to
      claim path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1a3cbbc5
  16. 25 4月, 2010 1 次提交
    • A
      fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices · b8af67e2
      Anton Blanchard 提交于
      We are seeing a large regression in database performance on recent
      kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
      number of threads write to different regions of the file at the same time.
      
      A simple test case is below.  I haven't defined DEVICE since getting it
      wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
      see about 17MB/sec and only a few threads in IO wait:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  3      0 16170  656 2259  0  0 86 14  0
       0  2      0 16704  695 2408  0  0 92  8  0
       0  2      0 17308  744 2653  0  0 86 14  0
       0  2      0 17933  759 2777  0  0 89 10  0
      
      Most threads are blocking in vfs_fsync_range, which has:
      
              mutex_lock(&mapping->host->i_mutex);
              err = fop->fsync(file, dentry, datasync);
              if (!ret)
                      ret = err;
              mutex_unlock(&mapping->host->i_mutex);
      
      commit 148f948b (vfs: Introduce new
      helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
      some explanation of what is going on:
      
          Use these new helpers for syncing from generic VFS functions. This makes
          O_SYNC writes to block devices acquire i_mutex for syncing. If we really
          care about this, we can make block_fsync() drop the i_mutex and reacquire
          it before it returns.
      
      Thanks Jan for such a good commit message!  As well as dropping i_mutex,
      Christoph suggests we should remove the call to sync_blockdev():
      
      > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
      > the block device inode, which is exactly what we did just before calling
      > into ->fsync
      
      The patch below incorporates both suggestions. With it the testcase improves
      from 17MB/s to 68M/sec:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  7      0 65536 1000 3878  0  0 70 30  0
       0 34      0 69632 1016 3921  0  1 46 53  0
       0 57      0 69632 1000 3921  0  0 55 45  0
       0 53      0 69640  754 4111  0  0 81 19  0
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <pthread.h>
      #include <unistd.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      #define NR_THREADS 64
      #define BUFSIZE (64 * 1024)
      
      #define DEVICE "/dev/mapper/XXXXXX"
      
      #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))
      
      static int fd;
      
      static void *doit(void *arg)
      {
      	unsigned long offset = (long)arg;
      	char *b, *buf;
      
      	b = malloc(BUFSIZE + 1024);
      	buf = (char *)ALIGN((unsigned long)b, 1024);
      	memset(buf, 0, BUFSIZE);
      
      	while (1)
      		pwrite(fd, buf, BUFSIZE, offset);
      }
      
      int main(int argc, char *argv[])
      {
      	int flags = O_RDWR|O_DIRECT;
      	int i;
      	unsigned long offset = 0;
      
      	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
      		flags |= O_SYNC;
      
      	fd = open(DEVICE, flags);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < NR_THREADS-1; i++) {
      		pthread_t tid;
      		pthread_create(&tid, NULL, doit, (void *)offset);
      		offset += BUFSIZE;
      	}
      	doit((void *)offset);
      
      	return 0;
      }
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8af67e2
  17. 07 4月, 2010 2 次提交
  18. 07 2月, 2010 1 次提交
  19. 29 10月, 2009 1 次提交
    • C
      blkdev: flush disk cache on ->fsync · ab0a9735
      Christoph Hellwig 提交于
      Currently there is no barrier support in the block device code.  That
      means we cannot guarantee any sort of data integerity when using the
      block device node with dis kwrite caches enabled.  Using the raw block
      device node is a typical use case for virtualization (and I assume
      databases, too).  This patch changes block_fsync to issue a cache flush
      and thus make fsync on block device nodes actually useful.
      
      Note that in mainline we would also need to add such code to the
      ->aio_write method for O_SYNC handling, but assuming that Jan's patch
      series for the O_SYNC rewrite goes in it will also call into ->fsync
      for 2.6.32.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ab0a9735
  20. 26 10月, 2009 1 次提交
  21. 24 9月, 2009 2 次提交
    • C
      freeze_bdev: grab active reference to frozen superblocks · 4504230a
      Christoph Hellwig 提交于
      Currently we held s_umount while a filesystem is frozen, despite that we
      might return to userspace and unlock it from a different process.  Instead
      grab an active reference to keep the file system busy and add an explicit
      check for frozen filesystems in remount and reject the remount instead
      of blocking on s_umount.
      
      Add a new get_active_super helper to super.c for use by freeze_bdev that
      grabs an active reference to a superblock from a given block device.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4504230a
    • C
      freeze_bdev: kill bd_mount_sem · 4fadd7bb
      Christoph Hellwig 提交于
      Now that we have the freeze count there is not much reason for bd_mount_sem
      anymore.  The actual freeze/thaw operations are serialized using the
      bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
      get_sb_bdev which tries to prevent mounting a filesystem while the block
      device is frozen.  Instead of add a check for bd_fsfreeze_count and
      return -EBUSY if a filesystem is frozen.  While that is a change in user
      visible behaviour a failing mount is much better for this case rather
      than having the mount process stuck uninterruptible for a long time.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4fadd7bb
  22. 22 9月, 2009 1 次提交
  23. 16 9月, 2009 1 次提交
  24. 14 9月, 2009 1 次提交
  25. 30 7月, 2009 1 次提交
  26. 12 6月, 2009 4 次提交