1. 22 5月, 2010 1 次提交
  2. 25 4月, 2010 1 次提交
    • A
      fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices · b8af67e2
      Anton Blanchard 提交于
      We are seeing a large regression in database performance on recent
      kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
      number of threads write to different regions of the file at the same time.
      
      A simple test case is below.  I haven't defined DEVICE since getting it
      wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
      see about 17MB/sec and only a few threads in IO wait:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  3      0 16170  656 2259  0  0 86 14  0
       0  2      0 16704  695 2408  0  0 92  8  0
       0  2      0 17308  744 2653  0  0 86 14  0
       0  2      0 17933  759 2777  0  0 89 10  0
      
      Most threads are blocking in vfs_fsync_range, which has:
      
              mutex_lock(&mapping->host->i_mutex);
              err = fop->fsync(file, dentry, datasync);
              if (!ret)
                      ret = err;
              mutex_unlock(&mapping->host->i_mutex);
      
      commit 148f948b (vfs: Introduce new
      helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
      some explanation of what is going on:
      
          Use these new helpers for syncing from generic VFS functions. This makes
          O_SYNC writes to block devices acquire i_mutex for syncing. If we really
          care about this, we can make block_fsync() drop the i_mutex and reacquire
          it before it returns.
      
      Thanks Jan for such a good commit message!  As well as dropping i_mutex,
      Christoph suggests we should remove the call to sync_blockdev():
      
      > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
      > the block device inode, which is exactly what we did just before calling
      > into ->fsync
      
      The patch below incorporates both suggestions. With it the testcase improves
      from 17MB/s to 68M/sec:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  7      0 65536 1000 3878  0  0 70 30  0
       0 34      0 69632 1016 3921  0  1 46 53  0
       0 57      0 69632 1000 3921  0  0 55 45  0
       0 53      0 69640  754 4111  0  0 81 19  0
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <pthread.h>
      #include <unistd.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      #define NR_THREADS 64
      #define BUFSIZE (64 * 1024)
      
      #define DEVICE "/dev/mapper/XXXXXX"
      
      #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))
      
      static int fd;
      
      static void *doit(void *arg)
      {
      	unsigned long offset = (long)arg;
      	char *b, *buf;
      
      	b = malloc(BUFSIZE + 1024);
      	buf = (char *)ALIGN((unsigned long)b, 1024);
      	memset(buf, 0, BUFSIZE);
      
      	while (1)
      		pwrite(fd, buf, BUFSIZE, offset);
      }
      
      int main(int argc, char *argv[])
      {
      	int flags = O_RDWR|O_DIRECT;
      	int i;
      	unsigned long offset = 0;
      
      	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
      		flags |= O_SYNC;
      
      	fd = open(DEVICE, flags);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < NR_THREADS-1; i++) {
      		pthread_t tid;
      		pthread_create(&tid, NULL, doit, (void *)offset);
      		offset += BUFSIZE;
      	}
      	doit((void *)offset);
      
      	return 0;
      }
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8af67e2
  3. 07 4月, 2010 2 次提交
  4. 07 2月, 2010 1 次提交
  5. 29 10月, 2009 1 次提交
    • C
      blkdev: flush disk cache on ->fsync · ab0a9735
      Christoph Hellwig 提交于
      Currently there is no barrier support in the block device code.  That
      means we cannot guarantee any sort of data integerity when using the
      block device node with dis kwrite caches enabled.  Using the raw block
      device node is a typical use case for virtualization (and I assume
      databases, too).  This patch changes block_fsync to issue a cache flush
      and thus make fsync on block device nodes actually useful.
      
      Note that in mainline we would also need to add such code to the
      ->aio_write method for O_SYNC handling, but assuming that Jan's patch
      series for the O_SYNC rewrite goes in it will also call into ->fsync
      for 2.6.32.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ab0a9735
  6. 26 10月, 2009 1 次提交
  7. 24 9月, 2009 2 次提交
    • C
      freeze_bdev: grab active reference to frozen superblocks · 4504230a
      Christoph Hellwig 提交于
      Currently we held s_umount while a filesystem is frozen, despite that we
      might return to userspace and unlock it from a different process.  Instead
      grab an active reference to keep the file system busy and add an explicit
      check for frozen filesystems in remount and reject the remount instead
      of blocking on s_umount.
      
      Add a new get_active_super helper to super.c for use by freeze_bdev that
      grabs an active reference to a superblock from a given block device.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4504230a
    • C
      freeze_bdev: kill bd_mount_sem · 4fadd7bb
      Christoph Hellwig 提交于
      Now that we have the freeze count there is not much reason for bd_mount_sem
      anymore.  The actual freeze/thaw operations are serialized using the
      bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
      get_sb_bdev which tries to prevent mounting a filesystem while the block
      device is frozen.  Instead of add a check for bd_fsfreeze_count and
      return -EBUSY if a filesystem is frozen.  While that is a change in user
      visible behaviour a failing mount is much better for this case rather
      than having the mount process stuck uninterruptible for a long time.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4fadd7bb
  8. 22 9月, 2009 1 次提交
  9. 16 9月, 2009 1 次提交
  10. 14 9月, 2009 1 次提交
  11. 30 7月, 2009 1 次提交
  12. 12 6月, 2009 4 次提交
  13. 05 6月, 2009 1 次提交
  14. 23 5月, 2009 1 次提交
  15. 28 4月, 2009 1 次提交
  16. 01 4月, 2009 1 次提交
  17. 28 3月, 2009 1 次提交
  18. 10 1月, 2009 1 次提交
    • T
      filesystem freeze: implement generic freeze feature · fcccf502
      Takashi Sato 提交于
      The ioctls for the generic freeze feature are below.
      o Freeze the filesystem
        int ioctl(int fd, int FIFREEZE, arg)
          fd: The file descriptor of the mountpoint
          FIFREEZE: request code for the freeze
          arg: Ignored
          Return value: 0 if the operation succeeds. Otherwise, -1
      
      o Unfreeze the filesystem
        int ioctl(int fd, int FITHAW, arg)
          fd: The file descriptor of the mountpoint
          FITHAW: request code for unfreeze
          arg: Ignored
          Return value: 0 if the operation succeeds. Otherwise, -1
          Error number: If the filesystem has already been unfrozen,
                        errno is set to EINVAL.
      
      [akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
      Signed-off-by: NTakashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: NMasayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
      Cc: <xfs-masters@oss.sgi.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcccf502
  19. 09 1月, 2009 1 次提交
    • N
      md: make devices disappear when they are no longer needed. · d3374825
      NeilBrown 提交于
      Currently md devices, once created, never disappear until the module
      is unloaded.  This is essentially because the gendisk holds a
      reference to the mddev, and the mddev holds a reference to the
      gendisk, this a circular reference.
      
      If we drop the reference from mddev to gendisk, then we need to ensure
      that the mddev is destroyed when the gendisk is destroyed.  However it
      is not possible to hook into the gendisk destruction process to enable
      this.
      
      So we drop the reference from the gendisk to the mddev and destroy the
      gendisk when the mddev gets destroyed.  However this has a
      complication.
      Between the call
         __blkdev_get->get_gendisk->kobj_lookup->md_probe
      and the call
         __blkdev_get->md_open
      
      there is no obvious way to hold a reference on the mddev any more, so
      unless something is done, it will disappear and gendisk will be
      destroyed prematurely.
      
      Also, once we decide to destroy the mddev, there will be an unlockable
      moment before the gendisk is unlinked (blk_unregister_region) during
      which a new reference to the gendisk can be created.  We need to
      ensure that this reference can not be used.  i.e. the ->open must
      fail.
      
      So:
       1/  in md_probe we set a flag in the mddev (hold_active) which
           indicates that the array should be treated as active, even
           though there are no references, and no appearance of activity.
           This is cleared by md_release when the device is closed if it
           is no longer needed.
           This ensures that the gendisk will survive between md_probe and
           md_open.
      
       2/  In md_open we check if the mddev we expect to open matches
           the gendisk that we did open.
           If there is a mismatch we return -ERESTARTSYS and modify
           __blkdev_get to retry from the top in that case.
           In the -ERESTARTSYS sys case we make sure to wait until
           the old gendisk (that we succeeded in opening) is really gone so
           we loop at most once.
      
      Some udev configurations will always open an md device when it first
      appears.   If we allow an md device that was just created by an open
      to disappear on an immediate close, then this can race with such udev
      configurations and result in an infinite loop the device being opened
      and closed, then re-open due to the 'ADD' even from the first open,
      and then close and so on.
      So we make sure an md device, once created by an open, remains active
      at least until some md 'ioctl' has been made on it.  This means that
      all normal usage of md devices will allow them to disappear promptly
      when not needed, but the worst that an incorrect usage will do it
      cause an inactive md device to be left in existence (it can easily be
      removed).
      
      As an array can be stopped by writing to a sysfs attribute
        echo clear > /sys/block/mdXXX/md/array_state
      we need to use scheduled work for deleting the gendisk and other
      kobjects.  This allows us to wait for any pending gendisk deletion to
      complete by simply calling flush_scheduled_work().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d3374825
  20. 07 1月, 2009 1 次提交
  21. 03 1月, 2009 1 次提交
  22. 01 1月, 2009 1 次提交
  23. 04 12月, 2008 2 次提交
  24. 06 11月, 2008 1 次提交
  25. 23 10月, 2008 1 次提交
  26. 21 10月, 2008 8 次提交
  27. 17 10月, 2008 1 次提交
    • R
      block: fix current kernel-doc warnings · 496aa8a9
      Randy Dunlap 提交于
      Fix block kernel-doc warnings:
      
      Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path'
      Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu'
      Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part'
      Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      496aa8a9