1. 17 12月, 2010 1 次提交
    • T
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo 提交于
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      77ea887e
  2. 13 11月, 2010 5 次提交
    • T
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo 提交于
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
        sb->s_mode.
      
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
        errors.
      
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      d4d77629
    • T
      block: check bdev_read_only() from blkdev_get() · 75f1dc0d
      Tejun Heo 提交于
      bdev read-only status can be queried using bdev_read_only() and may
      change while the device is being opened.  Enforce it by checking it
      from blkdev_get() after open succeeds.
      
      This makes bdev_read_only() check in open_bdev_exclusive() and
      fsg_lun_open() unnecessary.  Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Brownell <dbrownell@users.sourceforge.net>
      Cc: linux-usb@vger.kernel.org
      75f1dc0d
    • T
      block: reorganize claim/release implementation · 6a027eff
      Tejun Heo 提交于
      With claim/release rolled into blkdev_get/put(), there's no reason to
      keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
      separate functions.  It only makes the code difficult to follow.
      Collapse them into blkdev_get/put().  This will ease future changes
      around claim/release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6a027eff
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
    • T
      block: simplify holder symlink handling · e09b457b
      Tejun Heo 提交于
      Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
      complex with multiple holder considerations, redundant extra
      references to all involved kobjects, unused generic kobject holder
      support and unnecessary mixup with bd_claim/release functionalities.
      
      Strip it down to what's necessary (single gendisk holder) and make it
      use a separate interface.  This is a step for cleaning up
      bd_claim/release.  This patch makes dm-table slightly more complex but
      it will be simplified again with further changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      e09b457b
  3. 29 10月, 2010 1 次提交
  4. 26 10月, 2010 3 次提交
  5. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  6. 11 8月, 2010 1 次提交
  7. 10 8月, 2010 3 次提交
  8. 08 8月, 2010 1 次提交
    • A
      block: push down BKL into .open and .release · 6e9624b8
      Arnd Bergmann 提交于
      The open and release block_device_operations are currently
      called with the BKL held. In order to change that, we must
      first make sure that all drivers that currently rely
      on this have no regressions.
      
      This blindly pushes the BKL into all .open and .release
      operations for all block drivers to prepare for the
      next step. The drivers can subsequently replace the BKL
      with their own locks or remove it completely when it can
      be shown that it is not needed.
      
      The functions blkdev_get and blkdev_put are the only
      remaining users of the big kernel lock in the block
      layer, besides a few uses in the ioctl code, none
      of which need to serialize with blkdev_{get,put}.
      
      Most of these two functions is also under the protection
      of bdev->bd_mutex, including the actual calls to
      ->open and ->release, and the common code does not
      access any global data structures that need the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6e9624b8
  9. 05 8月, 2010 1 次提交
  10. 11 6月, 2010 3 次提交
  11. 28 5月, 2010 2 次提交
  12. 22 5月, 2010 2 次提交
    • J
      Introduce freeze_super and thaw_super for the fsfreeze ioctl · 18e9e510
      Josef Bacik 提交于
      Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
      letting it do all the work.  But freezing is more of an fs thing, and doesn't
      really have much to do with the bdev at all, all the work gets done with the
      super.  In btrfs we do not populate s_bdev, since we can have multiple bdev's
      for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
      This means that freezing a btrfs filesystem fails, which causes us to corrupt
      with things like tux-on-ice which use the fsfreeze mechanism.  So instead of
      populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
      freezing stuff into freeze_super and thaw_super.  These just take the
      super_block that we're freezing and does the appropriate work.  It's basically
      just copy and pasted from freeze_bdev.  I've then converted freeze_bdev over to
      use the new super helpers.  I've tested this with ext4 and btrfs and verified
      everything continues to work the same as before.
      
      The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
      the fs is already frozen.  I thought this was a better solution than adding a
      freeze counter to the super_block, but if everybody hates this idea I'm open to
      suggestions.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      18e9e510
    • A
      Move grabbing s_umount to callers of grab_super() · d3f21473
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d3f21473
  13. 29 4月, 2010 1 次提交
  14. 27 4月, 2010 2 次提交
    • T
      block: implement bd_claiming and claiming block · 6b4517a7
      Tejun Heo 提交于
      Currently, device claiming for exclusive open is done after low level
      open - disk->fops->open() - has completed successfully.  This means
      that exclusive open attempts while a device is already exclusively
      open will fail only after disk->fops->open() is called.
      
      cdrom driver issues commands during open() which means that O_EXCL
      open attempt can unintentionally inject commands to in-progress
      command stream for burning thus disturbing burning process.  In most
      cases, this doesn't cause problems because the first command to be
      issued is TUR which most devices can process in the middle of burning.
      However, depending on how a device replies to TUR during burning,
      cdrom driver may end up issuing further commands.
      
      This can't be resolved trivially by moving bd_claim() before doing
      actual open() because that means an open attempt which will end up
      failing could interfere other legit O_EXCL open attempts.
      ie. unconfirmed open attempts can fail others.
      
      This patch resolves the problem by introducing claiming block which is
      started by bd_start_claiming() and terminated either by bd_claim() or
      bd_abort_claiming().  bd_claim() from inside a claiming block is
      guaranteed to succeed and once a claiming block is started, other
      bd_start_claiming() or bd_claim() attempts block till the current
      claiming block is terminated.
      
      bd_claim() can still be used standalone although now it always
      synchronizes against claiming blocks, so the existing users will keep
      working without any change.
      
      blkdev_open() and open_bdev_exclusive() are converted to use claiming
      blocks so that exclusive open attempts from these functions don't
      interfere with the existing exclusive open.
      
      This problem was discovered while investigating bko#15403.
      
        https://bugzilla.kernel.org/show_bug.cgi?id=15403
      
      The burning problem itself can be resolved by updating userspace
      probing tools to always open w/ O_EXCL.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMatthias-Christian Ott <ott@mirix.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6b4517a7
    • T
      block: factor out bd_may_claim() · 1a3cbbc5
      Tejun Heo 提交于
      Factor out bd_may_claim() from bd_claim(), add comments and apply a
      couple of cosmetic edits.  This is to prepare for further updates to
      claim path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1a3cbbc5
  15. 25 4月, 2010 1 次提交
    • A
      fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices · b8af67e2
      Anton Blanchard 提交于
      We are seeing a large regression in database performance on recent
      kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
      number of threads write to different regions of the file at the same time.
      
      A simple test case is below.  I haven't defined DEVICE since getting it
      wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
      see about 17MB/sec and only a few threads in IO wait:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  3      0 16170  656 2259  0  0 86 14  0
       0  2      0 16704  695 2408  0  0 92  8  0
       0  2      0 17308  744 2653  0  0 86 14  0
       0  2      0 17933  759 2777  0  0 89 10  0
      
      Most threads are blocking in vfs_fsync_range, which has:
      
              mutex_lock(&mapping->host->i_mutex);
              err = fop->fsync(file, dentry, datasync);
              if (!ret)
                      ret = err;
              mutex_unlock(&mapping->host->i_mutex);
      
      commit 148f948b (vfs: Introduce new
      helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
      some explanation of what is going on:
      
          Use these new helpers for syncing from generic VFS functions. This makes
          O_SYNC writes to block devices acquire i_mutex for syncing. If we really
          care about this, we can make block_fsync() drop the i_mutex and reacquire
          it before it returns.
      
      Thanks Jan for such a good commit message!  As well as dropping i_mutex,
      Christoph suggests we should remove the call to sync_blockdev():
      
      > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
      > the block device inode, which is exactly what we did just before calling
      > into ->fsync
      
      The patch below incorporates both suggestions. With it the testcase improves
      from 17MB/s to 68M/sec:
      
      procs  -----io---- -system-- -----cpu------
       r  b     bi    bo   in   cs us sy id wa st
       0  7      0 65536 1000 3878  0  0 70 30  0
       0 34      0 69632 1016 3921  0  1 46 53  0
       0 57      0 69632 1000 3921  0  0 55 45  0
       0 53      0 69640  754 4111  0  0 81 19  0
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <pthread.h>
      #include <unistd.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      #define NR_THREADS 64
      #define BUFSIZE (64 * 1024)
      
      #define DEVICE "/dev/mapper/XXXXXX"
      
      #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))
      
      static int fd;
      
      static void *doit(void *arg)
      {
      	unsigned long offset = (long)arg;
      	char *b, *buf;
      
      	b = malloc(BUFSIZE + 1024);
      	buf = (char *)ALIGN((unsigned long)b, 1024);
      	memset(buf, 0, BUFSIZE);
      
      	while (1)
      		pwrite(fd, buf, BUFSIZE, offset);
      }
      
      int main(int argc, char *argv[])
      {
      	int flags = O_RDWR|O_DIRECT;
      	int i;
      	unsigned long offset = 0;
      
      	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
      		flags |= O_SYNC;
      
      	fd = open(DEVICE, flags);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	for (i = 0; i < NR_THREADS-1; i++) {
      		pthread_t tid;
      		pthread_create(&tid, NULL, doit, (void *)offset);
      		offset += BUFSIZE;
      	}
      	doit((void *)offset);
      
      	return 0;
      }
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8af67e2
  16. 07 4月, 2010 2 次提交
  17. 07 2月, 2010 1 次提交
  18. 29 10月, 2009 1 次提交
    • C
      blkdev: flush disk cache on ->fsync · ab0a9735
      Christoph Hellwig 提交于
      Currently there is no barrier support in the block device code.  That
      means we cannot guarantee any sort of data integerity when using the
      block device node with dis kwrite caches enabled.  Using the raw block
      device node is a typical use case for virtualization (and I assume
      databases, too).  This patch changes block_fsync to issue a cache flush
      and thus make fsync on block device nodes actually useful.
      
      Note that in mainline we would also need to add such code to the
      ->aio_write method for O_SYNC handling, but assuming that Jan's patch
      series for the O_SYNC rewrite goes in it will also call into ->fsync
      for 2.6.32.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ab0a9735
  19. 26 10月, 2009 1 次提交
  20. 24 9月, 2009 2 次提交
    • C
      freeze_bdev: grab active reference to frozen superblocks · 4504230a
      Christoph Hellwig 提交于
      Currently we held s_umount while a filesystem is frozen, despite that we
      might return to userspace and unlock it from a different process.  Instead
      grab an active reference to keep the file system busy and add an explicit
      check for frozen filesystems in remount and reject the remount instead
      of blocking on s_umount.
      
      Add a new get_active_super helper to super.c for use by freeze_bdev that
      grabs an active reference to a superblock from a given block device.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4504230a
    • C
      freeze_bdev: kill bd_mount_sem · 4fadd7bb
      Christoph Hellwig 提交于
      Now that we have the freeze count there is not much reason for bd_mount_sem
      anymore.  The actual freeze/thaw operations are serialized using the
      bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
      get_sb_bdev which tries to prevent mounting a filesystem while the block
      device is frozen.  Instead of add a check for bd_fsfreeze_count and
      return -EBUSY if a filesystem is frozen.  While that is a change in user
      visible behaviour a failing mount is much better for this case rather
      than having the mount process stuck uninterruptible for a long time.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4fadd7bb
  21. 22 9月, 2009 1 次提交
  22. 16 9月, 2009 1 次提交
  23. 14 9月, 2009 1 次提交
  24. 30 7月, 2009 1 次提交
  25. 12 6月, 2009 1 次提交