- 07 3月, 2014 1 次提交
-
-
由 Dave Chinner 提交于
When we map pages in the buffer cache, we can do so in GFP_NOFS contexts. However, the vmap interfaces do not provide any method of communicating this information to memory reclaim, and hence we get lockdep complaining about it regularly and occassionally see hangs that may be vmap related reclaim deadlocks. We can also see these same problems from anywhere where we use vmalloc for a large buffer (e.g. attribute code) inside a transaction context. A typical lockdep report shows up as a reclaim state warning like so: [14046.101458] ================================= [14046.102850] [ INFO: inconsistent lock state ] [14046.102850] 3.14.0-rc4+ #2 Not tainted [14046.102850] --------------------------------- [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes: [14046.102850] (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a [14046.102850] {RECLAIM_FS-ON-W} state was registered at: [14046.102850] [<7904cdb1>] mark_held_locks+0x81/0xe7 [14046.102850] [<7904d390>] lockdep_trace_alloc+0x5c/0xb4 [14046.102850] [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e [14046.102850] [<790ba7f4>] vm_map_ram+0x119/0x3e6 [14046.102850] [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf [14046.102850] [<7914ed74>] xfs_buf_get_map+0x67/0x13f [14046.102850] [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5 [14046.102850] [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d [14046.102850] [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8 [14046.102850] [<7916eefc>] xfs_attr_set+0x6b/0x74 [14046.102850] [<79168355>] xfs_xattr_set+0x61/0x81 [14046.102850] [<790e5b10>] generic_setxattr+0x59/0x68 [14046.102850] [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce [14046.102850] [<790e4d0a>] vfs_setxattr+0x8e/0x92 [14046.102850] [<790e4ddd>] setxattr+0xcf/0x159 [14046.102850] [<790e5423>] SyS_lsetxattr+0x88/0xbb [14046.102850] [<79268438>] sysenter_do_call+0x12/0x36 Now, we can't completely remove these traces - mainly because vm_map_ram() will do GFP_KERNEL allocation and that generates the above warning before we get into the reclaim code, but we can turn them all into false positive warnings. To do that, use the method that DM and other IO context code uses to avoid this problem: there is a process flag to tell memory reclaim not to do IO that we can set appropriately. That prevents GFP_KERNEL context reclaim being done from deep inside the vmalloc code in places we can't directly pass a GFP_NOFS context to. That interface has a pair of wrapper functions: memalloc_noio_save() and memalloc_noio_restore(). Adding them around vm_map_ram and the vzalloc call in kmem_alloc_large() will prevent deadlocks and most lockdep reports for this issue. Also, convert the vzalloc() call in kmem_alloc_large() to use __vmalloc() so that we can pass the correct gfp context to the data page allocation routine inside __vmalloc() so that it is clear that GFP_NOFS context is important to this vmalloc call. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 25 1月, 2014 3 次提交
-
-
由 Eric Sandeen 提交于
Some time ago, mkfs.xfs started picking the storage physical sector size as the default filesystem "sector size" in order to avoid RMW costs incurred by doing IOs at logical sector size alignments. However, this means that for a filesystem made with i.e. a 4k sector size on an "advanced format" 4k/512 disk, 512-byte direct IOs are no longer allowed. This means that XFS has essentially turned this AF drive into a hard 4K device, from the filesystem on up. XFS's mkfs-specified "sector size" is really just controlling the minimum size & alignment of filesystem metadata. There is no real need to tightly couple XFS's minimal metadata size to the minimum allowed direct IO size; XFS can continue doing metadata in optimal sizes, but still allow smaller DIOs for apps which issue them, for whatever reason. This patch adds a new field to the xfs_buftarg, so that we now track 2 sizes: 1) The metadata sector size, which is the minimum unit and alignment of IO which will be performed by metadata operations. 2) The device logical sector size The first is used internally by the file system for metadata alignment and IOs. The second is used for the minimum allowed direct IO alignment. This has passed xfstests on filesystems made with 4k sectors, including when run under the patch I sent to ignore XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly tested end of block behavior on preallocated, sparse, and existing files when we do a 512 IO into a 4k file on a 4k-sector filesystem, to be sure there were no unexpected behaviors. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Eric Sandeen 提交于
In preparation for adding new members to the structure, give these old ones more descriptive names: bt_ssize -> bt_meta_sectorsize bt_smask -> bt_meta_sectormask Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Eric Sandeen 提交于
Clean up the xfs_buftarg structure a bit: - remove bt_bsize which is never used - replace bt_sshift with bt_ssize; we only ever shift it back Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 17 12月, 2013 2 次提交
-
-
由 Dave Chinner 提交于
If we are doing aysnc writeback of metadata, we can get write errors but have nobody to report them to. At the moment, we simply attempt to reissue the write from io completion in the hope that it's a transient error. When it's not a transient error, the buffer is stuck forever in this loop, and we cannot break out of it. Eventually, unmount will hang because the AIL cannot be emptied and everything goes downhill from them. To solve this problem, only retry the write IO once before aborting it. We don't throw the buffer away because some transient errors can last minutes (e.g. FC path failover) or even hours (thin provisioned devices that have run out of backing space) before they go away. Hence we really want to keep trying until we can't try any more. Because the buffer was not cleaned, however, it does not get removed from the AIL and hence the next pass across the AIL will start IO on it again. As such, we still get the "retry forever" semantics that we currently have, but we allow other access to the buffer in the mean time. Meanwhile the filesystem can continue to modify the buffer and relog it, so the IO errors won't hang the log or the filesystem. Now when we are pushing the AIL, we can see all these "permanent IO error" buffers and we can issue a warning about failures before we retry the IO. We can also catch these buffers when unmounting an issue a corruption warning, too. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Christoph Hellwig 提交于
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that handles the case of a shut down filesystem. Most of the users have private, uncached buffers that can just be freed in this case, but the complex error handling in xfs_bioerror_relse messes up the case when it's called without a locked buffer. Remove xfsbdstrat and opencode the error handling in the callers. All but one can simply return an error and don't need to deal with buffer state, and the one caller that cares about the buffer state could do with a major cleanup as well, but we'll defer that to later. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 05 12月, 2013 1 次提交
-
-
由 Eric Sandeen 提交于
The "verbose" argument to xfs_setsize_buftarg_flags() has been unused since: ffe37436 xfs: stop using the page cache to back the buffer cache Remove it, and fold the function into xfs_setsize_buftarg() now that there's no need for different types of callers. Fix inconsistent comment spacing while we're at it. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 24 11月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
-
- 24 10月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
xfs_trans.h has a dependency on xfs_log.h for a couple of structures. Most code that does transactions doesn't need to know anything about the log, but this dependency means that they have to include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header files and clean up the includes to be in dependency order. In doing this, remove the direct include of xfs_trans_reserve.h from xfs_trans.h so that we remove the dependency between xfs_trans.h and xfs_mount.h. Hence the xfs_trans.h include can be moved to the indicate the actual dependencies other header files have on it. Note that these are kernel only header files, so this does not translate to any userspace changes at all. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 18 10月, 2013 1 次提交
-
-
由 Eric Sandeen 提交于
__xfs_printk adds its own "\n". Having it in the original string leads to unintentional blank lines from these messages. Most format strings have no newline, but a few do, leading to i.e.: [ 7347.119911] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119911] [ 7347.119919] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119919] Fix them all. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 11 9月, 2013 5 次提交
-
-
由 Glauber Costa 提交于
This patch adds the missing call to list_lru_destroy (spotted by Li Zhong) and moves the deletion to after the shrinker is unregistered, as correctly spotted by Dave Signed-off-by: NGlauber Costa <glommer@openvz.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Glauber Costa 提交于
We currently use a compile-time constant to size the node array for the list_lru structure. Due to this, we don't need to allocate any memory at initialization time. But as a consequence, the structures that contain embedded list_lru lists can become way too big (the superblock for instance contains two of them). This patch aims at ameliorating this situation by dynamically allocating the node arrays with the firmware provided nr_node_ids. Signed-off-by: NGlauber Costa <glommer@openvz.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Dave Chinner 提交于
In converting the buffer lru lists to use the generic code, the locking for marking the buffers as on the dispose list was lost. This results in confusion in LRU buffer tracking and acocunting, resulting in reference counts being mucked up and filesystem beig unmountable. To fix this, introduce an internal buffer spinlock to protect the state field that holds the dispose list information. Because there is now locking needed around xfs_buf_lru_add/del, and they are used in exactly one place each two lines apart, get rid of the wrappers and code the logic directly in place. Further, the LRU emptying code used on unmount is less than optimal. Convert it to use a dispose list as per a normal shrinker walk, and repeat the walk that fills the dispose list until the LRU is empty. Thi avoids needing to drop and regain the LRU lock for every item being freed, and allows the same logic as the shrinker isolate call to be used. Simpler, easier to understand. Signed-off-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NGlauber Costa <glommer@openvz.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Andrew Morton 提交于
fix warnings Cc: Dave Chinner <dchinner@redhat.com> Cc: Glauber Costa <glommer@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Dave Chinner 提交于
Convert the buftarg LRU to use the new generic LRU list and take advantage of the functionality it supplies to make the buffer cache shrinker node aware. Signed-off-by: NGlauber Costa <glommer@openvz.org> Signed-off-by: NDave Chinner <dchinner@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 21 8月, 2013 2 次提交
-
-
由 Zhi Yong Wu 提交于
Signed-off-by: NZhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Zhi Yong Wu 提交于
Signed-off-by: NZhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 13 8月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
The transaction reservation size calculations is used by both kernel and userspace, but most of the transaction code in xfs_trans.c is kernel specific. Split all the transaction reservation code out into it's own files to make sharing with userspace simpler. This just leaves kernel-only definitions in xfs_trans.h, so it doesn't need to be shared with userspace anymore, either. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 31 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com> (cherry picked from commit ad1858d7)
-
- 25 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
There are several places where we use KM_SLEEP allocation contexts and use the fact that they are called from transaction context to add KM_NOFS where appropriate. Unfortunately, there are several places where the code makes this assumption but can be called from outside transaction context but with filesystem locks held. These places need explicit KM_NOFS annotations to avoid lockdep complaining about reclaim contexts. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com> (cherry picked from commit ac14876c)
-
- 24 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 21 5月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
There are several places where we use KM_SLEEP allocation contexts and use the fact that they are called from transaction context to add KM_NOFS where appropriate. Unfortunately, there are several places where the code makes this assumption but can be called from outside transaction context but with filesystem locks held. These places need explicit KM_NOFS annotations to avoid lockdep complaining about reclaim contexts. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 19 3月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
Failed buffer readahead can leave the buffer in the cache marked with an error. Most callers that then issue a subsequent read on the buffer do not zero the b_error field out, and so we may incorectly detect an error during IO completion due to the stale error value left on the buffer. Avoid this problem by zeroing the error before IO submission. This ensures that the only IO errors that are detected those captured from are those captured from bio submission or completion. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com> (cherry picked from commit c163f9a1)
-
- 15 3月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
Failed buffer readahead can leave the buffer in the cache marked with an error. Most callers that then issue a subsequent read on the buffer do not zero the b_error field out, and so we may incorectly detect an error during IO completion due to the stale error value left on the buffer. Avoid this problem by zeroing the error before IO submission. This ensures that the only IO errors that are detected those captured from are those captured from bio submission or completion. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 08 3月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
When we read a buffer, we might get an error from the underlying block device and not the real data. Hence if we get an IO error, we shouldn't run the verifier but instead just pass the IO error straight through. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 15 2月, 2013 1 次提交
-
-
由 Brian Foster 提交于
The trylock log force invoked via xfs_buf_item_push() can attempt to acquire xa_lock, thus leading to a recursion bug when called with xa_lock held. This log force was originally added to xfs_buf_trylock() to address xfsaild stalls due to pinned and stale buffers. Since the addition of this behavior, the log item pushing code had been reworked to detect and track pinned items to inform xfsaild to issue a log force itself when necessary. As such, the log force on trylock failure is redundant and safe to remove. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 29 1月, 2013 2 次提交
-
-
由 Dave Chinner 提交于
When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 26 1月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 25 1月, 2013 1 次提交
-
-
由 Dave Chinner 提交于
When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 17 1月, 2013 1 次提交
-
-
由 Mark Tinguely 提交于
Commits starting at 77c1a08f introduced a multiple segment support to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment buffer in the transaction because it was looking at the single segment block number rather than the multi-segment b_maps[0].bm.bn. This results on a recursive buffer lock that can never be satisfied. This patch: 1) Changed the remaining b_map accesses to be b_maps[0] accesses. 2) Renames the single segment b_map structure to __b_map to avoid future confusion. Signed-off-by: NMark Tinguely <tinguely@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 18 12月, 2012 1 次提交
-
-
由 Mark Tinguely 提交于
Commits starting at 77c1a08f introduced a multiple segment support to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment buffer in the transaction because it was looking at the single segment block number rather than the multi-segment b_maps[0].bm.bn. This results on a recursive buffer lock that can never be satisfied. This patch: 1) Changed the remaining b_map accesses to be b_maps[0] accesses. 2) Renames the single segment b_map structure to __b_map to avoid future confusion. Signed-off-by: NMark Tinguely <tinguely@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 17 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@vger.kernel.org> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 16 11月, 2012 4 次提交
-
-
由 Dave Chinner 提交于
To separate the verifiers from iodone functions and associate read and write verifiers at the same time, introduce a buffer verifier operations structure to the xfs_buf. This avoids the need for assigning the write verifier, clearing the iodone function and re-running ioend processing in the read verifier, and gets rid of the nasty "b_pre_io" name for the write verifier function pointer. If we ever need to, it will also be easier to add further content specific callbacks to a buffer with an ops structure in place. We also avoid needing to export verifier functions, instead we can simply export the ops structures for those that are needed outside the function they are defined in. This patch also fixes a directory block readahead verifier issue it exposed. This patch also adds ops callbacks to the inode/alloc btree blocks initialised by growfs. These will need more work before they will work with CRCs. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NPhil White <pwhite@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
Add a callback to the buffer write path to enable verification of the buffer and CRC calculation prior to issuing the write to the underlying storage. If the callback function detects some kind of failure or error condition, it must mark the buffer with an error so that the caller can take appropriate action. In the case of xfs_buf_ioapply(), a corrupt metadta buffer willt rigger a shutdown of the filesystem, because something is clearly wrong and we can't allow corrupt metadata to be written to disk. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NPhil White <pwhite@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
With verification being done as an IO completion callback, different errors can be returned from a read. Uncached reads only return a buffer or NULL on failure, which means the verification error cannot be returned to the caller. Split the error handling for these reads into two - a failure to get a buffer will still return NULL, but a read error will return a referenced buffer with b_error set rather than NULL. The caller is responsible for checking the error state of the buffer returned. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NPhil White <pwhite@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
由 Dave Chinner 提交于
Add a verifier function callback capability to the buffer read interfaces. This will be used by the callers to supply a function that verifies the contents of the buffer when it is read from disk. This patch does not provide callback functions, but simply modifies the interfaces to allow them to be called. The reason for adding this to the read interfaces is that it is very difficult to tell fom the outside is a buffer was just read from disk or whether we just pulled it out of cache. Supplying a callbck allows the buffer cache to use it's internal knowledge of the buffer to execute it only when the buffer is read from disk. It is intended that the verifier functions will mark the buffer with an EFSCORRUPTED error when verification fails. This allows the reading context to distinguish a verification error from an IO error, and potentially take further actions on the buffer (e.g. attempt repair) based on the error reported. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NPhil White <pwhite@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 14 11月, 2012 1 次提交
-
-
由 Dave Chinner 提交于
Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@vger.kernel.org> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NMark Tinguely <tinguely@sgi.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 30 8月, 2012 1 次提交
-
-
由 Carlos Maiolino 提交于
While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with buffers from lru list), there is a possibility to have xfs_buf_stale() racing with it, and removing buffers from dispose list before xfs_buftarg_shrink() does it. This happens because xfs_buftarg_shrink() handle the dispose list without locking and the test condition in xfs_buf_stale() checks for the buffer being in *any* list: if (!list_empty(&bp->b_lru)) If the buffer happens to be on dispose list, this causes the buffer counter of lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink() and another in xfs_buf_stale()) causing a wrong account usage of the lru list. This may cause xfs_buftarg_shrink() to return a wrong value to the memory shrinker shrink_slab(), and such account error may also cause an underflowed value to be returned; since the counter is lower than the current number of items in the lru list, a decrement may happen when the counter is 0, causing an underflow on the counter. The fix uses a new flag field (and a new buffer flag) to serialize buffer handling during the shrink process. The new flag field has been designed to use btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism. dchinner, sandeen, aquini and aris also deserve credits for this. Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-
- 25 8月, 2012 1 次提交
-
-
由 Carlos Maiolino 提交于
While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with buffers from lru list), there is a possibility to have xfs_buf_stale() racing with it, and removing buffers from dispose list before xfs_buftarg_shrink() does it. This happens because xfs_buftarg_shrink() handle the dispose list without locking and the test condition in xfs_buf_stale() checks for the buffer being in *any* list: if (!list_empty(&bp->b_lru)) If the buffer happens to be on dispose list, this causes the buffer counter of lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink() and another in xfs_buf_stale()) causing a wrong account usage of the lru list. This may cause xfs_buftarg_shrink() to return a wrong value to the memory shrinker shrink_slab(), and such account error may also cause an underflowed value to be returned; since the counter is lower than the current number of items in the lru list, a decrement may happen when the counter is 0, causing an underflow on the counter. The fix uses a new flag field (and a new buffer flag) to serialize buffer handling during the shrink process. The new flag field has been designed to use btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism. dchinner, sandeen, aquini and aris also deserve credits for this. Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com> Reviewed-by: NBen Myers <bpm@sgi.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NBen Myers <bpm@sgi.com>
-