- 31 10月, 2014 1 次提交
-
-
由 David Jeffery 提交于
Author: David Jeffery <djeffery@redhat.com> Changes to the basic direct I/O code have broken the raw driver when reading to the end of a raw device. Instead of returning a short read for a read that extends partially beyond the device's end or 0 when at the end of the device, these reads now return EIO. The raw driver needs the same end of device handling as was added for normal block devices. Using blkdev_read_iter, which has the needed size checks, prevents the EIO conditions at the end of the device. Signed-off-by: NDavid Jeffery <djeffery@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 10 10月, 2014 1 次提交
-
-
由 Akinobu Mita 提交于
Sequential read from a block device is expected to be equal or faster than from the file on a filesystem. But it is not correct due to the lack of effective readpages() in the address space operations for block device. This implements readpages() operation for block device by using mpage_readpages() which can create multipage BIOs instead of BIOs for each page and reduce system CPU time consumption. Install 1GB of RAM disk storage: # modprobe scsi_debug dev_size_mb=1024 delay=0 Sequential read from file on a filesystem: # mkfs.ext4 /dev/$DEV # mount /dev/$DEV /mnt # fio --name=t --size=512m --rw=read --filename=/mnt/file ... read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec Sequential read from a block device: # fio --name=t --size=512m --rw=read --filename=/dev/$DEV ... (Without this commit) read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec (With this commit) read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2014 2 次提交
-
-
由 Tejun Heo 提交于
A block_device may be attached to different gendisks and thus different bdis over time. bdev_inode_switch_bdi() is used to switch the associated bdi. The function assumes that the inode could be dirty and transfers it between bdis if so. This is a bit nasty in that it reaches into bdi internals. This patch reimplements the function so that it writes out the inode if dirty. This is a lot simpler and can be implemented without exposing bdi internals. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 12 6月, 2014 1 次提交
-
-
由 Al Viro 提交于
iter_file_splice_write() - a ->splice_write() instance that gathers the pipe buffers, builds a bio_vec-based iov_iter covering those and feeds it to ->write_iter(). A bunch of simple cases coverted to that... [AV: fixed the braino spotted by Cyrill] Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 05 6月, 2014 1 次提交
-
-
由 Matthew Wilcox 提交于
A block device driver may choose to provide a rw_page operation. These will be called when the filesystem is attempting to do page sized I/O to page cache pages (ie not for direct I/O). This does preclude I/Os that are larger than page size, so this may only be a performance gain for some devices. Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com> Tested-by: NDheeraj Reddy <dheeraj.reddy@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 5月, 2014 5 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
unmodified, for now Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 4月, 2014 2 次提交
-
-
由 Johannes Weiner 提交于
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Carpenter 提交于
We know that "ret > 0" is true here. These tests were left over from commit 02afc27f ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 4月, 2014 1 次提交
-
-
由 Al Viro 提交于
It's always equal to &iocb->ki_pos, where iocb is the value of the 1st argument. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 9月, 2013 1 次提交
-
-
由 Christoph Hellwig 提交于
Call generic_write_sync() from the deferred I/O completion handler if O_DSYNC is set for a write request. Also make sure various callers don't call generic_write_sync if the direct I/O code returns -EIOCBQUEUED. Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 30 7月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: NKent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
-
- 09 7月, 2013 1 次提交
-
-
由 Jan Kara 提交于
It is very likely that block device inode will be part of BDI dirty list as well. However it doesn't make sence to sort inodes on the b_io list just because of this inode (as it contains buffers all over the device anyway). So save some CPU cycles which is valuable since we hold relatively contented wb->list_lock. Signed-off-by: NJan Kara <jack@suse.cz>
-
- 04 7月, 2013 1 次提交
-
-
由 Mel Gorman 提交于
Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 6月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 6月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Code in blkdev.c moves a device inode to default_backing_dev_info when the last reference to the device is put and moves the device inode back to its bdi when the first reference is acquired. This includes moving to wb.b_dirty list if the device inode is dirty. The code however doesn't setup timer to wake corresponding flusher thread and while wb.b_dirty list is non-empty __mark_inode_dirty() will not set it up either. Thus periodic writeback is effectively disabled until a sync(2) call which can lead to unexpected data loss in case of crash or power failure. Fix the problem by setting up a timer for periodic writeback in case we add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi(). Reported-by: NBert De Jonghe <Bert.DeJonghe@amplidata.com> CC: stable@vger.kernel.org # >= 3.0 Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 5月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: NKent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 5月, 2013 2 次提交
-
-
由 Al Viro 提交于
same story as with the previous patches - note that return value of blkdev_close() is lost, since there's nowhere the caller (__fput()) could return it to. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) *and* it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 5月, 2013 1 次提交
-
-
由 Yan Hong 提交于
Its only caller evict() has promised a non-NULL inode->i_bdev. Signed-off-by: NYan Hong <clouds.yan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 4月, 2013 1 次提交
-
-
由 Gu Zheng 提交于
blkdev_aio_read() test 'size' to see if it is equal or greater than the target count we request(iocb->ki_left). If so there is no need to call iov_shorten() to reduce number of segments and the iovec's length. So the judgement should be changed to 'if (size < iocb->ki_left)' instead. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 4月, 2013 1 次提交
-
-
由 Anatol Pomozov 提交于
struct block_device lifecycle is defined by its inode (see fs/block_dev.c) - block_device allocated first time we access /dev/loopXX and deallocated on bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile" we want that block_device stay alive until we destroy the loop device with "losetup -d". But because we do not hold /dev/loopXX inode its counter goes 0, and inode/bdev can be destroyed at any moment. Usually it happens at memory pressure or when user drops inode cache (like in the test below). When later in loop_clr_fd() we want to use bdev we have use-after-free error with following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000280 bd_set_size+0x10/0xa0 loop_clr_fd+0x1f8/0x420 [loop] lo_ioctl+0x200/0x7e0 [loop] lo_compat_ioctl+0x47/0xe0 [loop] compat_blkdev_ioctl+0x341/0x1290 do_filp_open+0x42/0xa0 compat_sys_ioctl+0xc1/0xf20 do_sys_open+0x16e/0x1d0 sysenter_dispatch+0x7/0x1a To prevent use-after-free we need to grab the device in loop_set_fd() and put it later in loop_clr_fd(). The issue is reprodusible on current Linus head and v3.3. Here is the test: dd if=/dev/zero of=loop.file bs=1M count=1 while [ true ]; do losetup /dev/loop0 loop.file echo 2 > /proc/sys/vm/drop_caches losetup -d /dev/loop0 done [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every time we call loop_set_fd() we check that loop_device->lo_state is Lo_unbound and set it to Lo_bound If somebody will try to set_fd again it will get EBUSY. And if we try to loop_clr_fd() on unbound loop device we'll get ENXIO. loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under loop_device->lo_ctl_mutex. ] Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 2月, 2013 3 次提交
-
-
由 Guo Chao 提交于
bd_openers is stable under bd_mutex, no need to check it twice. Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Guo Chao 提交于
blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device. If we update block size directly, reader may see intermediate result in some machines and configurations. Use i_size_write() instead. Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 MITSUNARI Shigeo 提交于
We found that bdev->bd_invalidated was left set once revalidate_disk() is called, which results in page cache flush every time that device is open. Specifically, we found this problem in MD block device. Once we resize a MD device, mdadm --monitor periodically flush all page cache for that device every 60 or 1000 seconds when it opens the device. This bug lies since at least 3.2.0 till the latest kernel(3.6.2). Patch is attached. The following steps will reproduce the problem. 1. prepair a block device (eg /dev/sdb). 2. create two partitions: sudo parted /dev/sdb mklabel gpt mkpart primary 0% 50% mkpart primary 50% 100% 3. create a md device. sudo mdadm -C /dev/md/hoge -l 1 -n 2 -e 1.2 --assume-clean --auto=md --symlink=no /dev/sdb1 /dev/sdb2 4. create file system and mount it sudo mkfs.ext3 /dev/md/hoge sudo mkdir /mnt/test sudo mount /dev/md/hoge /mnt/test 5. try to resize the device sudo mdadm -G /dev/md/hoge --size=max 6. create a file to fill file cache. sudo dd if=/dev/urandom of=/mnt/test/data bs=1M count=10 and verify the current status of file by free command. 7. mdadm monitor will open the md device every 1000 seconds and you will find all file cache on the device are cleared. The timing can be reduced by the following steps. a) kill mdadm and restart it with --delay option /sbin/mdadm --monitor --delay=30 --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog or open the md device directly. sudo dd if=/dev/md/hoge of=/dev/null bs=4096 count=1 Signed-off-by: NMITSUNARI Shigeo <herumi@nifty.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 12月, 2012 1 次提交
-
-
由 Andrew Morton 提交于
But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 12月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
The direct-IO write path already had the i_size checks in mm/filemap.c, but it turns out the read path did not, and removing the block size checks in fs/block_dev.c (commit bbec0270: "blkdev_max_block: make private to fs/buffer.c") removed the magic "shrink IO to past the end of the device" code there. Fix it by truncating the IO to the size of the block device, like the write path already does. NOTE! I suspect the write path would be *much* better off doing it this way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The mm/filemap.c code is extremely hard to follow, and has various conditionals on the target being a block device (ie the flag passed in to 'generic_write_checks()', along with a conditional update of the inode timestamp etc). It is also quite possible that we should treat this whole block device size as a "s_maxbytes" issue, and try to make the logic even more generic. However, in the meantime this is the fairly minimal targeted fix. Noted by Milan Broz thanks to a regression test for the cryptsetup reencrypt tool. Reported-and-tested-by: NMilan Broz <mbroz@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 11月, 2012 2 次提交
-
-
由 Linus Torvalds 提交于
We really don't want to look at the block size for the raw block device accesses in fs/block-dev.c, because it may be changing from under us. So get rid of the max_block logic entirely, since the caller should already have done it anyway. That leaves the only user of this function in fs/buffer.c, so move the whole function there and make it static. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This reverts the block-device direct access code to the previous unlocked code, now that fs/buffer.c no longer needs external locking. With this, fs/block_dev.c is back to the original version, apart from a whitespace cleanup that I didn't want to revert. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 10月, 2012 1 次提交
-
-
由 Mikulas Patocka 提交于
Functions generic_file_splice_read and generic_file_splice_write access the pagecache directly. For block devices these functions must be locked so that block size is not changed while they are in progress. This patch is an additional fix for commit b87570f5 ("Fix a crash when block device is read and block size is changed at the same time") that locked aio_read, aio_write and mmap against block size change. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 9月, 2012 3 次提交
-
-
由 Fengguang Wu 提交于
blkdev_mmap() isn't used outside of fs/block_dev.c, mark it as static. Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Mikulas Patocka 提交于
This avoids cache line bouncing when many processes lock the semaphore for read. New percpu lock implementation The lock consists of an array of percpu unsigned integers, a boolean variable and a mutex. When we take the lock for read, we enter rcu read section, check for a "locked" variable. If it is false, we increase a percpu counter on the current cpu and exit the rcu section. If "locked" is true, we exit the rcu section, take the mutex and drop it (this waits until a writer finished) and retry. Unlocking for read just decreases percpu variable. Note that we can unlock on a difference cpu than where we locked, in this case the counter underflows. The sum of all percpu counters represents the number of processes that hold the lock for read. When we need to lock for write, we take the mutex, set "locked" variable to true and synchronize rcu. Since RCU has been synchronized, no processes can create new read locks. We wait until the sum of percpu counters is zero - when it is, there are no readers in the critical section. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Mikulas Patocka 提交于
The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" You get a BUG. The direct and non-direct I/O is written with the assumption that block size does not change. It doesn't seem practical to fix these crashes one-by-one there may be many crash possibilities when block size changes at a certain place and it is impossible to find them all and verify the code. This patch introduces a new rw-lock bd_block_size_semaphore. The lock is taken for read during I/O. It is taken for write when changing block size. Consequently, block size can't be changed while I/O is being submitted. For asynchronous I/O, the patch only prevents block size change while the I/O is being submitted. The block size can change when the I/O is in progress or when the I/O is being finished. This is acceptable because there are no accesses to block size when asynchronous I/O is being finished. The patch prevents block size changing while the device is mapped with mmap. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 8月, 2012 1 次提交
-
-
由 Jianpeng Ma 提交于
For regular file, write operaion used blk_plug function.But for block file,write operation did not use blk_plug. This patch is also for write-cache mode for block-device. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Reviewed-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 7月, 2012 1 次提交
-
-
由 Jan Kara 提交于
Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-