提交 · 77ea887e433ad8389d416826936c110fa7910f80 · gsplhtlxg / clone-Linux

17 12月, 2010 1 次提交

implement in-kernel gendisk events handling · 77ea887e

由 Tejun Heo 提交于 12月 08, 2010

Currently, media presence polling for removeable block devices is done
from userland.  There are several issues with this.

* Polling is done by periodically opening the device.  For SCSI
  devices, the command sequence generated by such action involves a
  few different commands including TEST_UNIT_READY.  This behavior,
  while perfectly legal, is different from Windows which only issues
  single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
  ATAPI devices lock up after being periodically queried such command
  sequences.

* There is no reliable and unintrusive way for a userland program to
  tell whether the target device is safe for media presence polling.
  For example, polling for media presence during an on-going burning
  session can make it fail.  The polling program can avoid this by
  opening the device with O_EXCL but then it risks making a valid
  exclusive user of the device fail w/ -EBUSY.

* Userland polling is unnecessarily heavy and in-kernel implementation
  is lighter and better coordinated (workqueue, timer slack).

This patch implements framework for in-kernel disk event handling,
which includes media presence polling.

* bdops->check_events() is added, which supercedes ->media_changed().
  It should check whether there's any pending event and return if so.
  Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
  DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
  called parallelly.

* gendisk->events and ->async_events are added.  These should be
  initialized by block driver before passing the device to add_disk().
  The former contains the mask of all supported events and the latter
  the mask of all events which the device can report without polling.
  /sys/block/*/events[_async] export these to userland.

* Kernel parameter block.events_dfl_poll_msecs controls the system
  polling interval (default is 0 which means disable) and
  /sys/block/*/events_poll_msecs control polling intervals for
  individual devices (default is -1 meaning use system setting).  Note
  that if a device can report all supported events asynchronously and
  its polling interval isn't explicitly set, the device won't be
  polled regardless of the system polling interval.

* If a device is opened exclusively with write access, event checking
  is automatically disabled until all write exclusive accesses are
  released.

* There are event 'clearing' events.  For example, both of currently
  defined events are cleared after the device has been successfully
  opened.  This information is passed to ->check_events() callback
  using @clearing argument as a hint.

* Event checking is always performed from system_nrt_wq and timer
  slack is set to 25% for polling.

* Nothing changes for drivers which implement ->media_changed() but
  not ->check_events().  Going forward, all drivers will be converted
  to ->check_events() and ->media_change() will be dropped.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

77ea887e

13 11月, 2010 5 次提交

block: clean up blkdev_get() wrappers and their users · d4d77629

由 Tejun Heo 提交于 11月 13, 2010

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>

d4d77629

block: check bdev_read_only() from blkdev_get() · 75f1dc0d

由 Tejun Heo 提交于 11月 13, 2010

bdev read-only status can be queried using bdev_read_only() and may
change while the device is being opened.  Enforce it by checking it
from blkdev_get() after open succeeds.

This makes bdev_read_only() check in open_bdev_exclusive() and
fsg_lun_open() unnecessary.  Drop them.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: linux-usb@vger.kernel.org

75f1dc0d

block: reorganize claim/release implementation · 6a027eff

由 Tejun Heo 提交于 11月 13, 2010

With claim/release rolled into blkdev_get/put(), there's no reason to
keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
separate functions.  It only makes the code difficult to follow.
Collapse them into blkdev_get/put().  This will ease future changes
around claim/release.
Signed-off-by: NTejun Heo <tj@kernel.org>

6a027eff

block: make blkdev_get/put() handle exclusive access · e525fd89

由 Tejun Heo 提交于 11月 13, 2010

Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NNeil Brown <neilb@suse.de>
Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>

e525fd89

block: simplify holder symlink handling · e09b457b

由 Tejun Heo 提交于 11月 13, 2010

Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface.  This is a step for cleaning up
bd_claim/release.  This patch makes dm-table slightly more complex but
it will be simplified again with further changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NNeil Brown <neilb@suse.de>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com

e09b457b

29 10月, 2010 1 次提交
- A
  convert get_sb_pseudo() users · 51139ada
  由 Al Viro 提交于 7月 25, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  51139ada
26 10月, 2010 3 次提交

fs: inode split IO and LRU lists · 7ccf19a8

由 Nick Piggin 提交于 10月 21, 2010

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7ccf19a8

fs: switch bdev inode bdi's correctly · a5491e0c

由 Dave Chinner 提交于 10月 21, 2010

bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.

Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a5491e0c

new helper: ihold() · 7de9c6ee

由 Al Viro 提交于 10月 23, 2010

Clones an existing reference to inode; caller must already hold one.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7de9c6ee

17 9月, 2010 1 次提交

block: remove BLKDEV_IFL_WAIT · dd3932ed

由 Christoph Hellwig 提交于 9月 16, 2010

All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

dd3932ed

11 8月, 2010 1 次提交

blkdev: cgroup whitelist permission fix · b7300b78

由 Chris Wright 提交于 8月 10, 2010

The cgroup device whitelist code gets confused when trying to grant
permission to a disk partition that is not currently open.  Part of
blkdev_open() includes __blkdev_get() on the whole disk.

Basically, the only ways to reliably allow a cgroup access to a partition
on a block device when using the whitelist are to 1) also give it access
to the whole block device or 2) make sure the partition is already open in
a different context.

The patch avoids the cgroup check for the whole disk case when opening a
partition.

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=589662Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Tested-by: NSerge E. Hallyn <serue@us.ibm.com>
Reported-by: NVivek Goyal <vgoyal@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b7300b78

10 8月, 2010 3 次提交

A
convert remaining ->clear_inode() to ->evict_inode() · b57922d9
由 Al Viro 提交于 6月 07, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b57922d9

get rid of block_write_begin_newtrunc · 155130a4

由 Christoph Hellwig 提交于 6月 04, 2010

Move the call to vmtruncate to get rid of accessive blocks to the callers
in preparation of the new truncate sequence and rename the non-truncating
version to block_write_begin.

While we're at it also remove several unused arguments to block_write_begin.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

155130a4

sort out blockdev_direct_IO variants · eafdc7d1

由 Christoph Hellwig 提交于 6月 04, 2010

Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence. This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

eafdc7d1

08 8月, 2010 1 次提交

block: push down BKL into .open and .release · 6e9624b8

由 Arnd Bergmann 提交于 8月 07, 2010

The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.

This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.

The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.

Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

6e9624b8

05 8月, 2010 1 次提交

block_dev: always serialize exclusive open attempts · e75aa858

由 Tejun Heo 提交于 8月 04, 2010

bd_prepare_to_claim() incorrectly allowed multiple attempts for
exclusive open to progress in parallel if the attempting holders are
identical.  This triggered BUG_ON() as reported in the following bug.

  https://bugzilla.kernel.org/show_bug.cgi?id=16393

__bd_abort_claiming() is used to finish claiming blocks and doesn't
work if multiple openers are inside a claiming block.  Allowing
multiple parallel open attempts to continue doesn't gain anything as
those are serialized down in the call chain anyway.  Fix it by always
allowing only single open attempt in a claiming block.

This problem can easily be reproduced by adding a delay after
bd_prepare_to_claim() and attempting to mount two partitions of a
disk.

stable: only applicable to v2.6.35
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e75aa858

11 6月, 2010 3 次提交

block: remove duplicate BUG_ON() in bd_finish_claiming() · 3e6c0505

由 Jens Axboe 提交于 6月 07, 2010

We do the same BUG_ON() just a line later when calling into
__bd_abort_claiming().
Reported-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

3e6c0505

block: bd_start_claiming cleanup · b0018361

由 Nick Piggin 提交于 5月 26, 2010

I don't like the subtle multi-context code in bd_claim (ie.  detects where it
has been called based on bd_claiming). It seems clearer to just require a new
function to finish a 2-part claim.

Also improve commentary in bd_start_claiming as to how it should
be used.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

b0018361

block: bd_start_claiming fix module refcount · cf342570

由 Nick Piggin 提交于 5月 26, 2010

bd_start_claiming has an unbalanced module_put introduced in 6b4517a7.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

cf342570

28 5月, 2010 2 次提交

fs: convert simple fs to new truncate · 3322e79a

由 Nick Piggin 提交于 5月 27, 2010

Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate
sequence.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3322e79a

drop unused dentry argument to ->fsync · 7ea80859

由 Christoph Hellwig 提交于 5月 26, 2010

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7ea80859

22 5月, 2010 2 次提交

Introduce freeze_super and thaw_super for the fsfreeze ioctl · 18e9e510

由 Josef Bacik 提交于 3月 23, 2010

Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
letting it do all the work. But freezing is more of an fs thing, and doesn't
really have much to do with the bdev at all, all the work gets done with the
super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
This means that freezing a btrfs filesystem fails, which causes us to corrupt
with things like tux-on-ice which use the fsfreeze mechanism. So instead of
populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
freezing stuff into freeze_super and thaw_super. These just take the
super_block that we're freezing and does the appropriate work. It's basically
just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
use the new super helpers. I've tested this with ext4 and btrfs and verified
everything continues to work the same as before.

The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
the fs is already frozen. I thought this was a better solution than adding a
freeze counter to the super_block, but if everybody hates this idea I'm open to
suggestions. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

18e9e510

A
Move grabbing s_umount to callers of grab_super() · d3f21473
由 Al Viro 提交于 3月 23, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
d3f21473

29 4月, 2010 1 次提交

blkdev: generalize flags for blkdev_issue_fn functions · fbd9b09a

由 Dmitry Monakhov 提交于 4月 28, 2010

The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

fbd9b09a

27 4月, 2010 2 次提交

block: implement bd_claiming and claiming block · 6b4517a7

由 Tejun Heo 提交于 4月 07, 2010

Currently, device claiming for exclusive open is done after low level
open - disk->fops->open() - has completed successfully.  This means
that exclusive open attempts while a device is already exclusively
open will fail only after disk->fops->open() is called.

cdrom driver issues commands during open() which means that O_EXCL
open attempt can unintentionally inject commands to in-progress
command stream for burning thus disturbing burning process.  In most
cases, this doesn't cause problems because the first command to be
issued is TUR which most devices can process in the middle of burning.
However, depending on how a device replies to TUR during burning,
cdrom driver may end up issuing further commands.

This can't be resolved trivially by moving bd_claim() before doing
actual open() because that means an open attempt which will end up
failing could interfere other legit O_EXCL open attempts.
ie. unconfirmed open attempts can fail others.

This patch resolves the problem by introducing claiming block which is
started by bd_start_claiming() and terminated either by bd_claim() or
bd_abort_claiming().  bd_claim() from inside a claiming block is
guaranteed to succeed and once a claiming block is started, other
bd_start_claiming() or bd_claim() attempts block till the current
claiming block is terminated.

bd_claim() can still be used standalone although now it always
synchronizes against claiming blocks, so the existing users will keep
working without any change.

blkdev_open() and open_bdev_exclusive() are converted to use claiming
blocks so that exclusive open attempts from these functions don't
interfere with the existing exclusive open.

This problem was discovered while investigating bko#15403.

  https://bugzilla.kernel.org/show_bug.cgi?id=15403

The burning problem itself can be resolved by updating userspace
probing tools to always open w/ O_EXCL.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NMatthias-Christian Ott <ott@mirix.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

6b4517a7

block: factor out bd_may_claim() · 1a3cbbc5

由 Tejun Heo 提交于 4月 07, 2010

Factor out bd_may_claim() from bd_claim(), add comments and apply a
couple of cosmetic edits.  This is to prepare for further updates to
claim path.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

1a3cbbc5

25 4月, 2010 1 次提交

fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to block devices · b8af67e2

由 Anton Blanchard 提交于 4月 23, 2010

We are seeing a large regression in database performance on recent
kernels.  The database opens a block device with O_DIRECT|O_SYNC and a
number of threads write to different regions of the file at the same time.

A simple test case is below.  I haven't defined DEVICE since getting it
wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
see about 17MB/sec and only a few threads in IO wait:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  3      0 16170  656 2259  0  0 86 14  0
 0  2      0 16704  695 2408  0  0 92  8  0
 0  2      0 17308  744 2653  0  0 86 14  0
 0  2      0 17933  759 2777  0  0 89 10  0

Most threads are blocking in vfs_fsync_range, which has:

        mutex_lock(&mapping->host->i_mutex);
        err = fop->fsync(file, dentry, datasync);
        if (!ret)
                ret = err;
        mutex_unlock(&mapping->host->i_mutex);

commit 148f948b (vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
some explanation of what is going on:

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

Thanks Jan for such a good commit message!  As well as dropping i_mutex,
Christoph suggests we should remove the call to sync_blockdev():

> sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
> the block device inode, which is exactly what we did just before calling
> into ->fsync

The patch below incorporates both suggestions. With it the testcase improves
from 17MB/s to 68M/sec:

procs  -----io---- -system-- -----cpu------
 r  b     bi    bo   in   cs us sy id wa st
 0  7      0 65536 1000 3878  0  0 70 30  0
 0 34      0 69632 1016 3921  0  1 46 53  0
 0 57      0 69632 1000 3921  0  0 55 45  0
 0 53      0 69640  754 4111  0  0 81 19  0

Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
	unsigned long offset = (long)arg;
	char *b, *buf;

	b = malloc(BUFSIZE + 1024);
	buf = (char *)ALIGN((unsigned long)b, 1024);
	memset(buf, 0, BUFSIZE);

	while (1)
		pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
	int flags = O_RDWR|O_DIRECT;
	int i;
	unsigned long offset = 0;

	if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
		flags |= O_SYNC;

	fd = open(DEVICE, flags);
	if (fd == -1) {
		perror("open");
		exit(1);
	}

	for (i = 0; i < NR_THREADS-1; i++) {
		pthread_t tid;
		pthread_create(&tid, NULL, doit, (void *)offset);
		offset += BUFSIZE;
	}
	doit((void *)offset);

	return 0;
}
Signed-off-by: NAnton Blanchard <anton@samba.org>
Acked-by: NJan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8af67e2

07 4月, 2010 2 次提交

vfs: rename block_fsync() to blkdev_fsync() · b1dd3b28

由 Andrew Morton 提交于 4月 06, 2010

Requested by hch, for consistency now it is exported.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b1dd3b28

raw: fsync method is now required · 55ab3a1f

由 Anton Blanchard 提交于 4月 06, 2010

Commit 148f948b (vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode) broke
the raw driver.

We now call through generic_file_aio_write -> generic_write_sync ->
vfs_fsync_range.  vfs_fsync_range has:

        if (!fop || !fop->fsync) {
                ret = -EINVAL;
                goto out;
        }

But drivers/char/raw.c doesn't set an fsync method.

We have two options: fix it or remove the raw driver completely.  I'm
happy to do either, the fact this has been broken for so long suggests it
is rarely used.

The patch below adds an fsync method to the raw driver.  My knowledge of
the block layer is pretty sketchy so this could do with a once over.

If we instead decide to remove the raw driver, this patch might still be
useful as a backport to 2.6.33 and 2.6.32.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Tested-by: NJeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

55ab3a1f

07 2月, 2010 1 次提交

freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb · 4b06e5b9

由 Jun'ichi Nomura 提交于 1月 29, 2010

Thanks Thomas and Christoph for testing and review.
I removed 'smp_wmb()' before up_write from the previous patch,
since up_write() should have necessary ordering constraints.
(I.e. the change of s_frozen is visible to others after up_write)
I'm quite sure the change is harmless but if you are uncomfortable
with Tested-by/Reviewed-by on the modified patch, please remove them.

If MS_RDONLY, freeze_bdev should just up_write(s_umount) instead of
deactivate_locked_super().
Also, keep sb->s_frozen consistent so that remount can check the frozen state.

Otherwise a crash reported here can happen:
http://lkml.org/lkml/2010/1/16/37
http://lkml.org/lkml/2010/1/28/53

This patch should be applied for 2.6.32 stable series, too.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NThomas Backlund <tmb@mandriva.org>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4b06e5b9

29 10月, 2009 1 次提交

blkdev: flush disk cache on ->fsync · ab0a9735

由 Christoph Hellwig 提交于 10月 29, 2009

Currently there is no barrier support in the block device code.  That
means we cannot guarantee any sort of data integerity when using the
block device node with dis kwrite caches enabled.  Using the raw block
device node is a typical use case for virtualization (and I assume
databases, too).  This patch changes block_fsync to issue a cache flush
and thus make fsync on block device nodes actually useful.

Note that in mainline we would also need to add such code to the
->aio_write method for O_SYNC handling, but assuming that Jan's patch
series for the O_SYNC rewrite goes in it will also call into ->fsync
for 2.6.32.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ab0a9735

26 10月, 2009 1 次提交

block: use after free bug in __blkdev_get · 960cc0f4

由 Neil Brown 提交于 10月 26, 2009

commit 0762b8bd
(from 14 months ago) introduced a use-after-free bug which has just
recently started manifesting in my md testing.
I tried git bisect to find out what caused the bug to start
manifesting, and it could have been the recent change to
blk_unregister_queue (48c0d4d4) but the results were inconclusive.

This patch certainly fixes my symptoms and looks correct as the two
calls are now in the same order as elsewhere in that function.
Signed-off-by: NNeilBrown <neilb@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

960cc0f4

24 9月, 2009 2 次提交

freeze_bdev: grab active reference to frozen superblocks · 4504230a

由 Christoph Hellwig 提交于 8月 03, 2009

Currently we held s_umount while a filesystem is frozen, despite that we
might return to userspace and unlock it from a different process.  Instead
grab an active reference to keep the file system busy and add an explicit
check for frozen filesystems in remount and reject the remount instead
of blocking on s_umount.

Add a new get_active_super helper to super.c for use by freeze_bdev that
grabs an active reference to a superblock from a given block device.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4504230a

freeze_bdev: kill bd_mount_sem · 4fadd7bb

由 Christoph Hellwig 提交于 8月 03, 2009

Now that we have the freeze count there is not much reason for bd_mount_sem
anymore. The actual freeze/thaw operations are serialized using the
bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
get_sb_bdev which tries to prevent mounting a filesystem while the block
device is frozen. Instead of add a check for bd_fsfreeze_count and
return -EBUSY if a filesystem is frozen. While that is a change in user
visible behaviour a failing mount is much better for this case rather
than having the mount process stuck uninterruptible for a long time.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4fadd7bb

22 9月, 2009 1 次提交

const: make block_device_operations const · 83d5cde4

由 Alexey Dobriyan 提交于 9月 21, 2009

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83d5cde4

16 9月, 2009 1 次提交

fs: remove bdev->bd_inode_backing_dev_info · 2c96ce9f

由 Jens Axboe 提交于 9月 15, 2009

It has been unused since it was introduced in:

commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
Author: Andrew Morton <akpm@osdl.org>
Date:   Fri May 21 00:46:17 2004 -0700

    [PATCH] block device layer: separate backing_dev_info infrastructure

So lets just kill it.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

2c96ce9f

14 9月, 2009 1 次提交

vfs: Rename generic_file_aio_write_nolock · eef99380

由 Christoph Hellwig 提交于 8月 20, 2009

generic_file_aio_write_nolock() is now used only by block devices and raw
character device. Filesystems should use __generic_file_aio_write() in case
generic_file_aio_write() doesn't suit them. So rename the function to
blkdev_aio_write() and move it to fs/blockdev.c.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>

eef99380

30 7月, 2009 1 次提交

PM / Hibernate: Replace bdget call with simple atomic_inc of i_count · dddac6a7

由 Alan Jenkins 提交于 7月 29, 2009

Create bdgrab(). This function copies an existing reference to a
block_device. It is safe to call from any context.

Hibernation code wishes to copy a reference to the active swap device.
Right now it calls bdget() under a spinlock, but this is wrong because
bdget() can sleep. It doesn't need a full bdget() because we already
hold a reference to active swap devices (and the spinlock protects
against swapoff).

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827Signed-off-by: NAlan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

dddac6a7

12 6月, 2009 1 次提交

vfs: Rename fsync_super() to sync_filesystem() (version 4) · 60b0680f

由 Jan Kara 提交于 4月 27, 2009

Rename the function so that it better describe what it really does. Also
remove the unnecessary include of buffer_head.h.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

60b0680f