- 16 9月, 2009 24 次提交
-
-
由 Jan Kara 提交于
In case we fsync() a file and inode is not dirty, we don't force a transaction to disk and hence don't flush disk caches. Thus file data could be just in disk caches and not on persistent storage. Fix the problem by flushing disk caches if we didn't force a transaction commit. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Chris Mason 提交于
I've been struggling with this off and on while I've been testing the data=guarded work. The symptom is corrupted orphan lists and inodes with the wrong i_size stored on disk. I was convinced the data=guarded code was just missing a call to ext3_mark_inode_dirty, but tracing showed the i_disksize I was sending to ext3_mark_inode_dirty wasn't actually making it to the drive. ext3_mark_inode_dirty can be called without locks held (atime updates and a few others), so the data=guarded code uses locks while updating the in-memory inode, and then calls ext3_mark_inode_dirty without any locks held. But, ext3_mark_inode_dirty has no internal locking to make sure that only one CPU is updating the buffer head at a time. Generally this works out ok because everyone that changes the inode then calls ext3_mark_inode_dirty themselves. Even though it races, eventually someone updates the buffer heads and things move on. But there is still a risk of the wrong values getting in, and the data=guarded code seems to hit the race very often. Since everyone that changes the inode also logs it, it should be possible to fix this with some memory barriers. I'll leave that as an exercise to the reader and lock the buffer head instead. It it probably a good idea to have a different patch series for lockless bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears EXT3_STATE_NEW without any locks held. Signed-off-by: NChris Mason <chris.mason@oracle.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
During truncate we are sometimes forced to start a new transaction as the amount of blocks to be journaled is both quite large and hard to predict. So far we restarted a transaction while holding truncate_mutex and that violates lock ordering because truncate_mutex ranks below transaction start (and it can lead to a real deadlock with ext3_get_blocks() allocating new blocks from ext3_writepage()). Luckily, the problem is easy to fix: We just drop the truncate_mutex before restarting the transaction and acquire it afterwards. We are safe to do this as by the time ext3_truncate() is called, all the page cache for the truncated part of the file is dropped and so writepage() cannot come and allocate new blocks in the part of the file we are truncating. The rest of writers is stopped by us holding i_mutex. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
lockdep annotation for a transaction start has been at the end of journal_start(). But a transaction is also started from journal_restart(). Move the lockdep annotation to start_this_handle() which covers both cases. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
It does not make sense to store block number for journal as unsigned long since they can be only 32-bit (because of on-disk format limitation). So change in-memory structures and variables to use unsigned int instead. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Andreas Dilger 提交于
Fix jiffie rounding in jbd commit timer setup code. Rounding down could cause the timer to be fired before the corresponding transaction has expired. That transaction can stay not committed forever if no new transaction is created or explicit sync/umount happens. Signed-off-by: NAndreas Dilger <adilger@sun.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Nick Piggin 提交于
wb_clear_pending AFAIKS should not be called after the item has been put on the list, except by the worker threads. It could lead to the situation where the refcount is decremented below 0 and cause lots of problems. Presumably the !wb_has_dirty_io case is not a common one, so it can be discovered when the thread wakes up to check? Also add a comment in bdi_work_clear. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Nick Piggin 提交于
By the time bdi_work_on_stack gets evaluated again in bdi_work_free, it can already have been deallocated and used for something else in the !on stack case, giving a false positive in this test and causing corruption. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Nick Piggin 提交于
If you're going to do an atomic RMW on each list entry, there's not much point in all the RCU complexities of the list walking. This is only going to help the multi-thread case I guess, but it doesn't hurt to do now. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Nick Piggin 提交于
list_add_tail_rcu contains required barriers. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
Gets rid of a manual set_current_state(). Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
And document its retriever, get_next_work_item(). Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
bdi_start_writeback() is currently split into two paths, one for WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback() for WB_SYNC_ALL writeback and let bdi_start_writeback() handle only WB_SYNC_NONE. Push down the writeback_control allocation and only accept the parameters that make sense for each function. This cleans up the API considerably. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
This gets rid of work == NULL in bdi_queue_work() and puts the OOM handling where it belongs. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
Now that bdi_writeback_all() no longer handles integrity writeback, it doesn't have to block anymore. This means that we can switch bdi_list reader side protection to RCU. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
Data integrity writeback must use bdi_start_writeback() and ensure that wbc->sb and wbc->bdi are set. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
We do this automatically in get_sb_bdev() from the set_bdev_super() callback. Filesystems that have their own private backing_dev_info must assign that in ->fill_super(). Note that ->s_bdi assignment is required for proper writeback! Acked-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
We need to be able to pass in range_cyclic as well, so instead of growing yet another argument, split the arguments into a struct wb_writeback_args structure that we can use internally. Also makes it easier to just copy all members to an on-stack struct, since we can't access work after clearing the pending bit. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Christoph Hellwig 提交于
Since it's an opportunistic writeback and not a data integrity action, don't punt to blocking writeback. Just wakeup the thread and it will flush old data. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
It's only set, it's never checked. Kill it. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
It has been unused since it was introduced in: commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac Author: Andrew Morton <akpm@osdl.org> Date: Fri May 21 00:46:17 2004 -0700 [PATCH] block device layer: separate backing_dev_info infrastructure So lets just kill it. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jaswinder Singh Rajput 提交于
fix the following 'make includecheck' warning: fs/xfs/linux-2.6/xfs_iops.c: xfs_acl.h is included more than once. Signed-off-by: NJaswinder Singh Rajput <jaswinderrajput@gmail.com> Reviewed-by: NAlex Elder <aelder@sgi.com> Signed-off-by: NAlex Elder <aelder@sgi.com>
-
由 Alexey Dobriyan 提交于
create_proc_read_entry() is getting deprecated. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAlex Elder <aelder@sgi.com> Signed-off-by: NAlex Elder <aelder@sgi.com>
-
由 David Brownell 提交于
Let attribute group vectors be declared "const". We'd like to let most attribute metadata live in read-only sections... this is a start. Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
- 15 9月, 2009 4 次提交
-
-
由 Jan Kara 提交于
When we close a file, we remove preallocated blocks from it. But this truncation was not protected by i_mutex and thus it could have raced with a write through a different fd and cause crashes or even filesystem corruption. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
So far we preallocated blocks also for directories but that brings a problem, when to get rid of preallocated blocks we don't need. So far we removed them in udf_clear_inode() which has a disadvantage that 1) blocks are unavailable long after writing to a directory finished and thus one can get out of space unnecessarily early 2) releasing blocks from udf_clear_inode is problematic because VFS does not expect us to redirty inode there and it also slows down memory reclaim. So preallocate blocks only for regular files where we can drop preallocation in udf_release_file. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Recomputation of the pointer was wrong (it should have been just increment). Luckily, we never use the computed value. Remove it. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Remove code that gets never used. Signed-off-by: NJan Kara <jack@suse.cz>
-
- 14 9月, 2009 12 次提交
-
-
由 Christoph Hellwig 提交于
Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out the data, the calls into ->fsync to write out the metadata and then finally calls filemap_fdatawait to wait for the data I/O to complete. What sounds like a clever micro-optimization actually is nast trap for many filesystems. For many modern filesystems i_size or other inode information is only updated on I/O completion and we need to wait for I/O to finish before we can write out the metadata. For old fashionen filesystems that instanciate blocks during the actual write and also update the metadata at that point it opens up a large window were we could expose uninitialized blocks after a crash. While a few filesystems that need it already wait for the I/O to finish inside their ->fsync methods it is rather suboptimal as it is done under the i_mutex and also always for the whole file instead of just a part as we could do for O_SYNC handling. Here is a small audit of all fsync instances in the tree: - spufs_mfc_fsync: - ps3flash_fsync: - vol_cdev_fsync: - printer_fsync: - fb_deferred_io_fsync: - bad_file_fsync: - simple_sync_file: don't care - filesystems/drivers do't use the page cache or are purely in-memory. - simple_fsync: - file_fsync: - affs_file_fsync: - fat_file_fsync: - jfs_fsync: - ubifs_fsync: - reiserfs_dir_fsync: - reiserfs_sync_file: never touch pagecache themselves. We need to wait before if we do not want to expose stale data after an allocation. - afs_fsync: - fuse_fsync_common: do the waiting writeback itself in awkward ways, would benefit from proper semantics - block_fsync: Does a filemap_write_and_wait on the block device inode. Because we now have f_mapping that is the same inode we call it on in vfs_fsync. So just removing it and letting the VFS do the work in one go would be an improvement. - btrfs_sync_file: - cifs_fsync: - xfs_file_fsync: need the wait first and currently do it themselves. would benefit from doing it outside i_mutex. - coda_fsync: - ecryptfs_fsync: - exofs_file_fsync: - shm_fsync: only passes the fsync through to the lower layer - ext3_sync_file: doesn't seem to care, comments are confusing. - ext4_sync_file: would need the wait to work correctly for delalloc mode with late i_size updates. Otherwise the ext3 comment applies. currently implemens it's own writeback and wait in an odd way, could benefit from doing it properly. - gfs2_fsync: not needed for journaled data mode, but probably harmless there. Currently writes back data asynchronously itself. Needs some major audit. - hostfs_fsync: just calls fsync/datasync on the host FD. Without the wait before data might not even be inflight yet if we're unlucky. - hpfs_file_fsync: - ncp_fsync: no-ops. Dangerous before and after. - jffs2_fsync: just calls jffs2_flush_wbuf_gc, not sure how this relates to data. - nfs_fsync_dir: just increments stats, claims all directory operations are synchronous - nfs_file_fsync: only writes out data??? Looks very odd. - nilfs_sync_file: looks like it expects all data done, but not sure from the code - ntfs_dir_fsync: - ntfs_file_fsync: appear to do their own data writeback. Very convoluted code. - ocfs2_sync_file: does it's own data writeback, but no wait. probably needs the wait. - smb_fsync: according to a comment expects all pages written already, probably needs the wait before. This patch only changes vfs_fsync_range, removal of the wait in the methods that have it is left to the filesystem maintainers. Note that most filesystems really do need an audit for their fsync methods given the gems found in this very brief audit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Remove these three functions since nobody uses them anymore. Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the only user of generic_osync_inode() which does not have a file open. So opencode needed actions for FAT so that we can convert generic_osync_inode() to a standard syncing path. Update a comment about generic_osync_inode(). CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Christoph Hellwig says that it is enough for XFS to call filemap_write_and_wait_range() instead of sync_page_range() because we do all the metadata syncing when forcing the log. CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Christoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Update ocfs2 specific splicing code to use generic syncing helper. The sync now does not happen under rw_lock because generic_write_sync() acquires i_mutex which ranks above rw_lock. That should not matter because standard fsync path does not hold it either. Acked-by: NJoel Becker <Joel.Becker@oracle.com> Acked-by: NMark Fasheh <mfasheh@suse.com> CC: ocfs2-devel@oss.oracle.com Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Use new syncing helpers in .write and .aio_write functions. Also remove superfluous syncing in ntfs_file_buffered_write() and update comments about generic_osync_inode(). CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
The syncing is now properly handled by generic_file_aio_write() so no special ext4 code is needed. CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Syncing is now properly done by generic_file_aio_write() so no special logic is needed in ext3. CC: linux-ext4@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
We rely on generic_write_sync() now. CC: linux-ext4@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Christoph Hellwig 提交于
generic_file_aio_write_nolock() is now used only by block devices and raw character device. Filesystems should use __generic_file_aio_write() in case generic_file_aio_write() doesn't suit them. So rename the function to blkdev_aio_write() and move it to fs/blockdev.c. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Use the new helper. We have to submit data pages ourselves in case of O_SYNC write because __generic_file_aio_write does not do it for us. OCFS2 developpers might think about moving the sync out of i_mutex which seems to be easily possible but that's out of scope of this patch. CC: ocfs2-devel@oss.oracle.com Acked-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NJan Kara <jack@suse.cz>
-