- 13 8月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
There is a bunch of code in xfs_bmap.c that is kernel specific and not shared with userspace. To minimise the difference between the kernel and userspace code, shift this unshared code to xfs_bmap_util.c, and the declarations to xfs_bmap_util.h. The biggest issue here is xfs_bmap_finish() - userspace has it's own definition of this function, and so we need to move it out of xfs_bmap.[ch]. This means several other files need to include xfs_bmap_util.h as well. It also introduces and interesting dance for the stack switching code in xfs_bmapi_allocate(). The stack switching/workqueue code is actually moved to xfs_bmap_util.c, so that userspace can simply use a #define in a header file to connect the dots without needing to know about the stack switch code at all. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 23 7月, 2013 1 次提交
-
-
由 Jie Liu 提交于
In xfs_vm_write_failed(), we evaluate the block_offset of pos with PAGE_MASK which is an unsigned long. That is fine on 64-bit platforms regardless of whether the request pos is 32-bit or 64-bit. However, on 32-bit platforms the value is 0xfffff000 and so the high 32 bits in it will be masked off with (pos & PAGE_MASK) for a 64-bit pos. As a result, the evaluated block_offset is incorrect which will cause this failure ASSERT(block_offset + from == pos); and potentially pass the wrong block to xfs_vm_kill_delalloc_range(). In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled: XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:100! invalid opcode: 0000 [#1] SMP ........ Pid: 4057, comm: mkfs.xfs Tainted: G O 3.9.0-rc2 #1 EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0 EIP is at assfail+0x2b/0x30 [xfs] EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4 ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000) Stack: 00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2 00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000 00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080 Call Trace: [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs] [<c15c7e89>] ? printk+0x4d/0x4f [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs] [<c110a34c>] generic_file_buffered_write+0xdc/0x210 [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs] [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs] [<c115e504>] do_sync_write+0x94/0xd0 [<c115ed1f>] vfs_write+0x8f/0x160 [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50 [<c115f017>] sys_write+0x47/0x80 [<c15d860d>] sysenter_do_call+0x12/0x28 ............. EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c ---[ end trace cdd9af4f4ecab42f ]--- Kernel panic - not syncing: Fatal exception In order to avoid this, we can evaluate the block_offset of the start of the page by using shifts rather than masks the mismatch problem. Thanks Dave Chinner for help finding and fixing this bug. Reported-by: NMichael L. Semon <mlsemon35@gmail.com> Reviewed-by: NDave Chinner <david@fromorbit.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NJie Liu <jeff.liu@oracle.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 25 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com> (cherry picked from commit 49b137cb)
-
- 22 5月, 2013 2 次提交
-
-
由 Lukas Czerner 提交于
->invalidatepage() aop now accepts range to invalidate so we can make use of it in xfs_vm_invalidatepage() Signed-off-by: NLukas Czerner <lczerner@redhat.com> Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com
-
由 Lukas Czerner 提交于
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: NLukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
-
- 21 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
FSX on 512 byte block size filesystems has been failing for some time with corrupted data. The fault dates back to the change in the writeback data integrity algorithm that uses a mark-and-sweep approach to avoid data writeback livelocks. Unfortunately, a side effect of this mark-and-sweep approach is that each page will only be written once for a data integrity sync, and there is a condition in writeback in XFS where a page may require two writeback attempts to be fully written. As a result of the high level change, we now only get a partial page writeback during the integrity sync because the first pass through writeback clears the mark left on the page index to tell writeback that the page needs writeback.... The cause is writing a partial page in the clustering code. This can happen when a mapping boundary falls in the middle of a page - we end up writing back the first part of the page that the mapping covers, but then never revisit the page to have the remainder mapped and written. The fix is simple - if the mapping boundary falls inside a page, then simple abort clustering without touching the page. This means that the next ->writepage entry that write_cache_pages() will make is the page we aborted on, and xfs_vm_writepage() will map all sections of the page correctly. This behaviour is also optimal for non-data integrity writes, as it results in contiguous sequential writeback of the file rather than missing small holes and having to write them a "random" writes in a future pass. With this fix, all the fsx tests in xfstests now pass on a 512 byte block size filesystem on a 4k page machine. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 08 5月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: NKent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 3月, 2013 1 次提交
-
-
由 Jan Kara 提交于
When a dirty page is truncated from a file but reclaim gets to it before truncate_inode_pages(), we hit WARN_ON(delalloc) in xfs_vm_releasepage(). This is because reclaim tries to write the page, xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim thinks it can continue and calls xfs_vm_releasepage() on page with dirty buffers. Fix the issue by redirtying the page in xfs_vm_writepage(). This makes reclaim stop reclaiming the page and also logically it keeps page in a more consistent state where page with dirty buffers has PageDirty set. Signed-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 29 1月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 26 1月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 30 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
The direct IO path can do a nested transaction reservation when writing past the EOF. The first transaction is the append transaction for setting the filesize at IO completion, but we can also need a transaction for allocation of blocks. If the log is low on space due to reservations and small log, the append transaction can be granted after wating for space as the only active transaction in the system. This then attempts a reservation for an allocation, which there isn't space in the log for, and the reservation sleeps. The result is that there is nothing left in the system to wake up all the processes waiting for log space to come free. The stack trace that shows this deadlock is relatively innocuous: xlog_grant_head_wait xlog_grant_head_check xfs_log_reserve xfs_trans_reserve xfs_iomap_write_direct __xfs_get_blocks xfs_get_blocks_direct do_blockdev_direct_IO __blockdev_direct_IO xfs_vm_direct_IO generic_file_direct_write xfs_file_dio_aio_writ xfs_file_aio_write do_sync_write vfs_write This was discovered on a filesystem with a log of only 10MB, and a log stripe unit of 256k whih increased the base reservations by 512k. Hence a allocation transaction requires 1.2MB of log space to be available instead of only 260k, and so greatly increased the chance that there wouldn't be enough log space available for the nested transaction to succeed. The key to reproducing it is this mkfs command: mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV The test case was a 1000 fsstress processes running with random freeze and unfreezes every few seconds. Thanks to Eryu Guan (eguan@redhat.com) for writing the test that found this on a system with a somewhat unique default configuration.... cc: <stable@vger.kernel.org> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAndrew Dahl <adahl@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 17 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 15 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
It is a complex wrapper around VFS functions, but there are VFS functions that provide exactly the same functionality. Call the VFS functions directly and remove the unnecessary indirection and complexity. We don't need to care about clearing the XFS_ITRUNCATED flag, as that is done during .writepages. Hence is cleared by the VFS writeback path if there is anything to write back during the flush. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NAndrew Dahl <adahl@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 14 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 31 7月, 2012 1 次提交
-
-
由 Jan Kara 提交于
Generic code now blocks all writers from standard write paths. So we add blocking of all writers coming from ioctl (we get a protection of ioctl against racing remount read-only as a bonus) and convert xfs_file_aio_write() to a non-racy freeze protection. We also keep freeze protection on transaction start to block internal filesystem writes such as removal of preallocated blocks. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> CC: xfs@oss.sgi.com Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 23 7月, 2012 1 次提交
-
-
由 Alain Renaud 提交于
Add a XFS_ prefix to IO_DIRECT,XFS_IO_DELALLOC, XFS_IO_UNWRITTEN and XFS_IO_OVERWRITE. This to avoid namespace conflict with other modules. Signed-off-by: NAlain Renaud <arenaud@sgi.com> Reviewed-by: NRich Johnston <rjohnston@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 22 7月, 2012 1 次提交
-
-
由 Christoph Hellwig 提交于
We need to zero out part of a page which beyond EOF before setting uptodate, otherwise, mapread or write will see non-zero data beyond EOF. Based on the code in fs/buffer.c and the following ext4 commit: ext4: handle EOF correctly in ext4_bio_write_page() And yes, I wish we had a good test case for it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 21 6月, 2012 1 次提交
-
-
由 Alain Renaud 提交于
On filesytems with a block size smaller than PAGE_SIZE we currently have a problem with unwritten extents. If a we have multi-block page for which an unwritten extent has been allocated, and only some of the buffers have been written to, and they are not contiguous, we can expose stale data from disk in the blocks between the writes after extent conversion. Example of a page with unwritten and real data. buffer content 0 empty b_state = 0 1 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 2 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 3 empty b_state = 0 4 empty b_state = 0 5 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 6 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 7 empty b_state = 0 Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7 empty. Currently buffers 1, 2, 5, and 6 are added to a single ioend, and when IO has completed, extent conversion creates a real extent from block 1 through block 6, leaving 0 and 7 unwritten. However buffers 3 and 4 were not written to disk, so stale data is exposed from those blocks on a subsequent read. Fix this by setting iomap_valid = 0 when we find a buffer that is not Uptodate. This ensures that buffers 5 and 6 are not added to the same ioend as buffers 1 and 2. Later these blocks will be converted into two separate real extents, leaving the blocks in between unwritten. Signed-off-by: NAlain Renaud <arenaud@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 15 6月, 2012 2 次提交
-
-
由 Dave Chinner 提交于
The m_maxioffset field in the struct xfs_mount contains the same value as the superblock s_maxbytes field. There is no need to carry two copies of this limit around, so use the VFS superblock version. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Alain Renaud 提交于
On filesytems with a block size smaller than PAGE_SIZE we currently have a problem with unwritten extents. If a we have multi-block page for which an unwritten extent has been allocated, and only some of the buffers have been written to, and they are not contiguous, we can expose stale data from disk in the blocks between the writes after extent conversion. Example of a page with unwritten and real data. buffer content 0 empty b_state = 0 1 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 2 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 3 empty b_state = 0 4 empty b_state = 0 5 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 6 DATA b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten 7 empty b_state = 0 Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7 empty. Currently buffers 1, 2, 5, and 6 are added to a single ioend, and when IO has completed, extent conversion creates a real extent from block 1 through block 6, leaving 0 and 7 unwritten. However buffers 3 and 4 were not written to disk, so stale data is exposed from those blocks on a subsequent read. Fix this by setting iomap_valid = 0 when we find a buffer that is not Uptodate. This ensures that buffers 5 and 6 are not added to the same ioend as buffers 1 and 2. Later these blocks will be converted into two separate real extents, leaving the blocks in between unwritten. Signed-off-by: NAlain Renaud <arenaud@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 15 5月, 2012 8 次提交
-
-
由 Dave Chinner 提交于
With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
The only thing left in xfs_rw.h is a function prototype for an inode function. Move that to xfs_inode.h, and kill xfs_rw.h. Also move the function implementing the prototype from xfs_rw.c to xfs_inode.c so we only have one function left in xfs_rw.c Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
Untangle the header file includes a bit by moving the definition of xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include xfs_ag.h. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
xfstest 229 exposes a problem with buffered IO, delayed allocation and extent size hints. That is when we do delayed allocation during buffered IO, we reserve space for the extent size hint alignment and allocate the physical space to align the extent, but we do not zero the regions of the extent that aren't written by the write(2) syscall. The result is that we expose stale data in unwritten regions of the extent size hints. There are two ways to fix this. The first is to detect that we are doing unaligned writes, check if there is already a mapping or data over the extent size hint range, and if not zero the page cache first before then doing the real write. This can be very expensive for large extent size hints, especially if the subsequent writes fill then entire extent size before the data is written to disk. The second, and simpler way, is simply to turn off delayed allocation when the extent size hint is set and use preallocation instead. This results in unwritten extents being laid down on disk and so only the written portions will be converted. This matches the behaviour for direct IO, and will also work for the real time device. The disadvantage of this approach is that for small extent size hints we can get file fragmentation, but in general extent size hints are fairly large (e.g. stripe width sized) so this isn't a big deal. Implement the second approach as it is simple and effective. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
When a partial write inside EOF fails, it can leave delayed allocation blocks lying around because they don't get punched back out. This leads to assert failures like: XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847 when evicting inodes from the cache. This can be trivially triggered by xfstests 083, which takes between 5 and 15 executions on a 512 byte block size filesystem to trip over this. Debugging shows a failed write due to ENOSPC calling xfs_vm_write_failed such as: [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae and no action is taken on it. This leaves behind a delayed allocation extent that has no page covering it and no data in it: [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0 [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1 [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1 ^^^^^^^^^^^^^^^ [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa So the delayed allocation extent is one block long at offset 0x16c00. Tracing shows that a bigger write: xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags allocates the block, and then fails with ENOSPC trying to allocate the last block on the page, leading to a failed write with stale delalloc blocks on it. Because we've had an ENOSPC when trying to allocate 0x16e00, it means that we are never goinge to call ->write_end on the page and so the allocated new buffer will not get marked dirty or have the buffer_new state cleared. In other works, what the above write is supposed to end up with is this mapping for the page: +------+------+------+------+------+------+------+------+ UMA UMA UMA UMA UMA UMA UND FAIL where: U = uptodate M = mapped N = new A = allocated D = delalloc FAIL = block we ENOSPC'd on. and the key point being the buffer_new() state for the newly allocated delayed allocation block. Except it doesn't - we're not marking buffers new correctly. That buffer_new() problem goes back to the xfs_iomap removal days, where xfs_iomap() used to return a "new" status for any map with newly allocated blocks, so that __xfs_get_blocks() could call set_buffer_new() on it. We still have the "new" variable and the check for it in the set_buffer_new() logic - except we never set it now! Hence that newly allocated delalloc block doesn't have the new flag set on it, so when the write fails we cannot tell which blocks we are supposed to punch out. WHy do we need the buffer_new flag? Well, that's because we can have this case: +------+------+------+------+------+------+------+------+ UMD UMD UMD UMD UMD UMD UND FAIL where all the UMD buffers contain valid data from a previously successful write() system call. We only want to punch the UND buffer because that's the only one that we added in this write and it was only this write that failed. That implies that even the old buffer_new() logic was wrong - because it would result in all those UMD buffers on the page having set_buffer_new() called on them even though they aren't new. Hence we shoul donly be calling set_buffer_new() for delalloc buffers that were allocated (i.e. were a hole before xfs_iomap_write_delay() was called). So, fix this set_buffer_new logic according to how we need it to work for handling failed writes correctly. Also, restore the new buffer logic handling for blocks allocated via xfs_iomap_write_direct(), because it should still set the buffer_new flag appropriately for newly allocated blocks, too. SO, now we have the buffer_new() being set appropriately in __xfs_get_blocks(), we can detect the exact delalloc ranges that we allocated in a failed write, and hence can now do a walk of the buffers on a page to find them. Except, it's not that easy. When block_write_begin() fails, it unlocks and releases the page that we just had an error on, so we can't use that page to handle errors anymore. We have to get access to the page while it is still locked to walk the buffers. Hence we have to open code block_write_begin() in xfs_vm_write_begin() to be able to insert xfs_vm_write_failed() is the right place. With that, we can pass the page and write range to xfs_vm_write_failed() and walk the buffers on the page, looking for delalloc buffers that are either new or beyond EOF and punch them out. Handling buffers beyond EOF ensures we still handle the existing case that xfs_vm_write_failed() handles. Of special note is the truncate_pagecache() handling - that only should be done for pages outside EOF - pages within EOF can still contain valid, dirty data so we must not punch them out of the cache. That just leaves the xfs_vm_write_end() failure handling. The only failure case here is that we didn't copy the entire range, and generic_write_end() handles that by zeroing the region of the page that wasn't copied, we don't have to punch out blocks within the file because they are guaranteed to contain zeros. Hence we only have to handle the existing "beyond EOF" case and don't need access to the buffers on the page. Hence it remains largely unchanged. Note that xfs_getbmap() can still trip over delalloc blocks beyond EOF that are left there by speculative delayed allocation. Hence this bug fix does not solve all known issues with bmap vs delalloc, but it does fix all the the known accidental occurances of the problem. Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
xfs_is_delayed_page() checks to see if a page has buffers matching the given IO type passed in. It does so by walking the buffer heads on the page and checking if the state flags match the IO type. However, the "acceptable" variable that is calculated is overwritten every time a new buffer is checked. Hence if the first buffer on the page is of the right type, this state is lost if the second buffer is not of the correct type. This means that xfs_aops_discard_page() may not discard delalloc regions when it is supposed to, and xfs_convert_page() may not cluster IO as efficiently as possible. This problem only occurs on filesystems with a block size smaller than page size. Also, rename xfs_is_delayed_page() to xfs_check_page_type() to better describe what it is doing - it is not delalloc specific anymore. The problem was first noticed by Peter Watkins. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
I've been seeing regular ASSERT failures in xfstests when running fsstress based tests over the past month. xfs_getbmap() has been failing this test: XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c, line: 5650 where it is encountering a delayed allocation extent after writing all the dirty data to disk and then walking the extent map atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed allocation extents from being created. Test 083 on a 512 byte block size filesystem was used to reproduce the problem, because it only had a 5s run timeand would usually fail every 3-4 runs. This test is exercising ENOSPC behaviour by running fsstress on a nearly full filesystem. The following trace extract shows the final few events on the inode that tripped the assert: xfs_ilock: flags ILOCK_EXCL caller xfs_setfilesize xfs_setfilesize: isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680 file size updated to 0x180000 by IO completion xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_iext_insert: state idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180000 count 512 type startoff 0xc00 startblock -1 blockcount 0x1 xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks delalloc write, adding a single block at offset 0x180000 xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180200 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180200 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_get_blocks_alloc: size 0x180000 offset 0x180200 count 512 type startoff 0xc00 startblock -1 blockcount 0x2 And succeeding on retry after flushing dirty inodes. xfs_ilock: flags ILOCK_EXCL caller __xfs_get_blocks xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 ENOSPC trying to allocate a dellalloc block at offset 0x180400 xfs_ilock: flags ILOCK_EXCL caller xfs_iomap_write_delay xfs_delalloc_enospc: isize 0x180000 disize 0x180000 offset 0x180400 count 512 And failing the retry, giving a real ENOSPC error. xfs_ilock: flags ILOCK_EXCL caller xfs_vm_write_failed ^^^^^^^^^^^^^^^^^^^ The smoking gun - the write being failed and cleaning up delalloc blocks beyond EOF allocated by the failed write. xfs_getattr: xfs_ilock: flags IOLOCK_SHARED caller xfs_getbmap xfs_ilock: flags ILOCK_SHARED caller xfs_ilock_map_shared And that's where we died almost immediately afterwards. xfs_bmapi_read() found delalloc extent beyond current file in memory file size. Some debug I added to xfs_getbmap() showed the state just before the assert failure: ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000 start_fsb 0x106, end_fsb 0x638 ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00 ext 0: off 0x1fc, fsb 0x24782, len 0x254 ext 1: off 0x450, fsb 0x40851, len 0x30 ext 2: off 0x480, fsb 0xd99, len 0x1b8 ext 3: off 0x92f, fsb 0x4099a, len 0x3b ext 4: off 0x96d, fsb 0x41844, len 0x98 ext 5: off 0xbf1, fsb 0x408ab, len 0xf which shows that we found a single delalloc block beyond EOF (first line of output) when we were returning the map for a length somewhere around 10^16 bytes long (second line), and the on-disk extents showed they didn't go past EOF (last lines). Further debug added to xfs_vm_write_failed() showed this happened when punching out delalloc blocks beyond the end of the file after the failed write: [ 132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000 [ 132.609573] start_fsb 0xc01, end_fsb 0xc08 It punched the range 0xc01 -> 0xc08, but the range we really need to punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was run on a 512 byte block size filesystem (8 blocks per page). the punch from is 0xc00. So end_fsb is correct, but start_fsb is wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks. Hence we are not punching the delalloc block beyond EOF in the case. The fix is simple - it's a silly off-by-one mistake in calculating the range. It's especially silly because the macro used to calculate the start_fsb already takes into account the case where the inode size is an exact multiple of the filesystem block size... Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
For the direct IO write path, we only really need the ilock to be taken in exclusive mode during IO submission if we need to do extent allocation instead of all the time. Change the block mapping code to take the ilock in shared mode for the initial block mapping, and only retake it exclusively when we actually have to perform extent allocations. We were already dropping the ilock for the transaction allocation, so this doesn't introduce new race windows. Based on an earlier patch from Dave Chinner. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 14 3月, 2012 1 次提交
-
-
由 Christoph Hellwig 提交于
Do not use unlogged metadata updates and the VFS dirty bit for updating the file size after writeback. In addition to causing various problems with updates getting delayed for far too long this also drags in the unscalable VFS dirty tracking, and is one of the few remaining unlogged metadata updates. Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 06 3月, 2012 3 次提交
-
-
由 Christoph Hellwig 提交于
If we convert and unwritten extent past the current i_size log the size update as part of the extent manipulation transactions instead of doing an unlogged metadata update later. Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Christoph Hellwig 提交于
Replace xfs_ioend_new_eof with a new inline xfs_new_eof helper that doesn't require and ioend, and is available also outside of xfs_aops.c. Also make the code a bit more clear by using a normal if statement instead of a slightly misleading MIN(). Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Christoph Hellwig 提交于
The new concurrency managed workqueues are cheap enough that we can create per-filesystem instead of global workqueues. This allows us to remove the trylock or defer scheme on the ilock, which is not helpful once we have outstanding log reservations until finishing a size update. Also allow the default concurrency on this workqueues so that I/O completions blocking on the ilock for one inode do not block process for another inode. Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 18 1月, 2012 2 次提交
-
-
由 Christoph Hellwig 提交于
Now that we use the VFS i_size field throughout XFS there is no need for the i_new_size field any more given that the VFS i_size field gets updated in ->write_end before unlocking the page, and thus is always uptodate when writeback could see a page. Removing i_new_size also has the advantage that we will never have to trim back di_size during a failed buffered write, given that it never gets updated past i_size. Note that currently the generic direct I/O code only updates i_size after calling our end_io handler, which requires a small workaround to make sure di_size actually makes it to disk. I hope to fix this properly in the generic code. A downside is that we lose the support for parallel non-overlapping O_DIRECT appending writes that recently was added. I don't think keeping the complex and fragile i_new_size infrastructure for this is a good tradeoff - if we really care about parallel appending writers we should investigate turning the iolock into a range lock, which would also allow for parallel non-overlapping buffered writers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Christoph Hellwig 提交于
There is no fundamental need to keep an in-memory inode size copy in the XFS inode. We already have the on-disk value in the dinode, and the separate in-memory copy that we need for regular files only in the XFS inode. Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the VFS inode i_size field for regular files. Switch code that was directly accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where we are limited to regular files direct access of the VFS inode i_size field. This also allows dropping some fairly complicated code in the write path which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size that is getting updated inside ->write_end. Note that we do not bother resetting the VFS i_size when truncating a file that gets freed to zero as there is no point in doing so because the VFS inode is no longer in use at this point. Just relax the assert in xfs_ifree to only check the on-disk size instead. Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 09 11月, 2011 1 次提交
-
-
由 Christoph Hellwig 提交于
Ensure ioend->io_error gets propagated back to e.g. AIO completions. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAlex Elder <aelder@sgi.com>
-
- 01 11月, 2011 1 次提交
-
-
由 Mel Gorman 提交于
Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 10月, 2011 4 次提交
-
-
由 Dave Chinner 提交于
xfs_bmapi() currently handles both extent map reading and allocation. As a result, the code is littered with "if (wr)" branches to conditionally do allocation operations if required. This makes the code much harder to follow and causes significant indent issues with the code. Given that read mapping is much simpler than allocation, we can split out read mapping from xfs_bmapi() and reuse the logic that we have already factored out do do all the hard work of handling the extent map manipulations. The results in a much simpler function for the common extent read operations, and will allow the allocation code to be simplified in another commit. Once xfs_bmapi_read() is implemented, convert all the callers of xfs_bmapi() that are only reading extents to use the new function. Signed-off-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAlex Elder <aelder@sgi.com>
-
由 Christoph Hellwig 提交于
Return unwritten extent conversion errors to aio_complete. Skip both unwritten extent conversion and size updates if we had an I/O error or the filesystem has been shut down. Return -EIO to the aio/buffer completion handlers in case of a forced shutdown. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NAlex Elder <aelder@sgi.com>
-
由 Christoph Hellwig 提交于
We now have an i_dio_count filed and surrounding infrastructure to wait for direct I/O completion instead of i_icount, and we have never needed to iocount waits for buffered I/O given that we only set the page uptodate after finishing all required work. Thus remove i_iocount, and replace the actually needed waits with calls to inode_dio_wait. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NAlex Elder <aelder@sgi.com>
-
由 Christoph Hellwig 提交于
There is no reason to queue up ioends for processing in user context unless we actually need it. Just complete ioends that do not convert unwritten extents or need a size update from the end_io context. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NAlex Elder <aelder@sgi.com>
-