- 21 10月, 2011 15 次提交
-
-
由 Steven Whitehouse 提交于
Each block which is deallocated, requires a call to gfs2_rlist_add() and each of those calls was calling gfs2_blk2rgrpd() in order to figure out which rgrp the block belonged in. This can be speeded up by making use of the rgrp cached in the inode. We also reset this cached rgrp in case the block has changed rgrp. This should provide a big reduction in gfs2_blk2rgrpd() calls during deallocation. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
The recursive_scan() function only ever takes a single "bc" argument, so we might as well just call do_strip() directly from resource_scan() rather than pass it in as an argument. Also the "data" argument is always a struct strip_mine, so we can pass that in, rather than using a void pointer. This also moves do_strip() ahead of recursive_scan() so that we don't need to add a prototype. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
Given that a resource group has been locked, there is no reason why we should not be able to allocate as many blocks as are free. The al_requested parameter should really be considered as a minimum number of blocks to be available. Should this limit be overshot, there are other mechanisms which will prevent over allocation. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
This means that after the initial allocation for any inode, the last used resource group is cached in the inode for future use. This drastically reduces the number of lookups of resource groups in the common case, and this the contention on that data structure. The allocation algorithm is the same as previously, except that we always check to see if the goal block is within the cached rgrp first before going to the rbtree to look one up. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
Since we have ruled out supporting online filesystem shrink, it is possible to make the resource group list append only during the life of a super block. This gives several benefits: Firstly, we only need to read new rindex elements as they are added rather than needing to reread the whole rindex file each time one element is added. Secondly, the rindex glock can be held for much shorter periods of time, and is completely removed from the fast path for allocations. The lock is taken in shared mode only when updating the resource groups when the first allocation occurs, and after a grow has taken place. Thirdly, this results in a reduction in code size, and everything gets a lot simpler to understand in this area. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Bob Peterson 提交于
Here is an update of Bob's original rbtree patch which, in addition, also resolves the rather strange ref counting that was being done relating to the bitmap blocks. Originally we had a dual system for journaling resource groups. The metadata blocks were journaled and also the rgrp itself was added to a list. The reason for adding the rgrp to the list in the journal was so that the "repolish clones" code could be run to update the free space, and potentially send any discard requests when the log was flushed. This was done by comparing the "cloned" bitmap with what had been written back on disk during the transaction commit. Due to this, there was a requirement to hang on to the rgrps' bitmap buffers until the journal had been flushed. For that reason, there was a rather complicated set up in the ->go_lock ->go_unlock functions for rgrps involving both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference count on the buffers. However, the journal maintains a reference count on the buffers anyway, since they are being journaled as metadata buffers. So by moving the code which deals with the post-journal accounting for bitmap blocks to the metadata journaling code, we can entirely dispense with the rather strange buffer ref counting scheme and also the requirement to journal the rgrps. The net result of all this is that the ->sd_rindex_spin is left to do exactly one job, and that is to look after the rbtree or rgrps. This patch is designed to be a stepping stone towards using RCU for the rbtree of resource groups, however the reduction in the number of uses of the ->sd_rindex_spin is likely to have benefits for multi-threaded workloads, anyway. The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also be removed in future in favour of calling the functions directly where required in the code. That will allow locking of resource groups without needing to actually read them in - something that could be useful in speeding up statfs. In the mean time though it is valid to dereference ->bi_bh only when the rgrp is locked. This is basically the same rule as before, modulo the references not being valid until the following journal flush. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com> Cc: Benjamin Marzinski <bmarzins@redhat.com>
-
由 Steven Whitehouse 提交于
We need to take the inode's glock whenever the inode's size is referenced, otherwise it might not be uptodate. Even though generic_file_llseek_unlocked() doesn't implement SEEK_DATA, SEEK_HOLE directly, it does reference the inode's size in those cases, so we need to add them to the list of origins which need the glock. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Cc: Andi Kleen <ak@linux.intel.com>
-
由 Steven Whitehouse 提交于
If we pass through knowledge of whether the creation is intended to be exclusive or not, then we can deal with that in gfs2_create_inode and remove one set of locking. Also this removes the loop in gfs2_create and simplifies the code a bit. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
The aim of this patch is to use the newly enhanced ->dirty_inode() super block operation to deal with atime updates, rather than piggy backing that code into ->write_inode() as is currently done. The net result is a simplification of the code in various places and a reduction of the number of gfs2_dinode_out() calls since this is now implied by ->dirty_inode(). Some of the mark_inode_dirty() calls have been moved under glocks in order to take advantage of then being able to avoid locking in ->dirty_inode() when we already have suitable locks. One consequence is that generic_write_end() now correctly deals with file size updates, so that we do not need a separate check for that afterwards. This also, indirectly, means that fdatasync should work correctly on GFS2 - the current code always syncs the metadata whether it needs to or not. Has survived testing with postmark (with and without atime) and also fsx. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
Journaled data requires that a complete flush of all dirty data for the file is done, in order that the ail flush which comes after will succeed. Also the recently enhanced bug trap can trigger falsely in case an ail flush from fsync races with a page read. This updates the bug trap such that it will ignore buffers which are locked and only trigger on dirty and/or pinned buffers when the ail flush is run from fsync. The original bug trap is retained when ail flush is run from ->go_sync() Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
If we have got far enough through the inode allocation code path that an inode has already been allocated, then we must call iput to dispose of it, if an error occurs during a later part of the process. This will always be the final iput since there will be no other references to the inode. Unlike when the inode has been unlinked, its block state will be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need to skip the test in ->evict_inode() for this one case in order to ensure that it will be deallocated correctly. This patch adds a new flag in order to ensure that this will happen correctly. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
We do not need to start a transaction unless the atime check has proved positive. Also if we are going to flush the complete ail list anyway, we might as well skip the writeback for this specific inode's metadata, since that will be done as part of the ail writeback process in an order offering potentially more efficient I/O. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
The assert was being tested under the wrong lock, a legacy of the original code. Also, if it does trigger, the resulting information was not always a lot of help. This moves the patch under the correct lock and also prints out more useful information in tacking down the source of the problem. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
Now that the data writing is part of fsync proper, we can split the waiting part out and do it later on. This reduces the number of waits that we do during fsync on average. There is also no need to take the i_mutex unless we are flushing metadata to disk, so we can move that to within the metadata flushing code. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
Since there is now only a single caller to gfs2_dir_read_data() and it has a number of constant arguments, we can factor those out. Also some tests relating to the inode size were being done twice. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
- 23 8月, 2011 2 次提交
-
-
由 Christoph Hellwig 提交于
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NNamhyung Kim <namhyung@gmail.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Christoph Hellwig 提交于
Replace all occurnanced of the undocumented READ_META with READ | REQ_META and remove the unused WRITE_META define. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 01 8月, 2011 2 次提交
-
-
由 Al Viro 提交于
... so that &inode->i_mode could be passed to it Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
so we can pass &inode->i_mode to it Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 27 7月, 2011 1 次提交
-
-
由 Arun Sharma 提交于
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: NArun Sharma <asharma@fb.com> Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: NMike Frysinger <vapier@gentoo.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 7月, 2011 5 次提交
-
-
由 Steven Whitehouse 提交于
Depending upon the order of userspace/kernel during the mount process, this can result in a hang without the _all version of the completion. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Christoph Hellwig 提交于
Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with modified clone, on failure releases acl and replaces with NULL. Returns 0 or -ve on error. All callers of posix_acl_create_masq() switched. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified clone or with NULL if that has failed; returns 0 or -ve on error. All callers of posix_acl_chmod_masq() switched to that - they'd been doing exactly the same thing. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Linus Torvalds 提交于
This moves logic for checking the cached ACL values from low-level filesystems into generic code. The end result is a streamlined ACL check that doesn't need to load the inode->i_op->check_acl pointer at all for the common cached case. The filesystems also don't need to check for a non-blocking RCU walk case in their acl_check() functions, because that is all handled at a VFS layer. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 21 7月, 2011 3 次提交
-
-
由 Al Viro 提交于
d_splice_alias() will DTRT when given NULL or ERR_PTR Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Josef Bacik 提交于
Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 20 7月, 2011 5 次提交
-
-
由 Al Viro 提交于
not used by the instances anymore. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
not used in the instances anymore. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 15 7月, 2011 4 次提交
-
-
由 Eric Sandeen 提交于
__gfs2_free_data and __gfs2_free_meta are almost identical, and can be trivially combined. [This is as per Eric's original patch minus gfs2_free_data() which had no callers left and plus the conversion of the bmap.c calls to these functions. All in all, a nice clean up] Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
This adds S_NOSEC support to GFS2. We set/reset the flag either when a user calls setattr or when we have just regained the glock from another node. The flag is only set if there are no xattrs on the inode and there is no suid bit set. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
-
由 Bob Peterson 提交于
This patch is a performance improvement for GFS2 in a clustered environment. It makes the glock hold time self-adjusting. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Steven Whitehouse 提交于
This patch adds a cache for the hash table to the directory code in order to help simplify the way in which the hash table is accessed. This is intended to be a first step towards introducing some performance improvements in the directory code. There are two follow ups that I'm hoping to see fairly shortly. One is to simplify the hash table reading code now that we always read the complete hash table, whether we want one entry or all of them. The other is to introduce readahead on the heads of the hash chains which are referred to from the table. The hash table is a maximum of 128k in size, so it is not worth trying to read it in small chunks. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
- 14 7月, 2011 1 次提交
-
-
由 Steven Whitehouse 提交于
This patch contains a few misc fixes which resolve a recently reported issue. This patch has been a real team effort and has received a lot of testing. The first issue is that the ail lock needs to be held over a few more operations. The lock thats added into gfs2_releasepage() may possibly be a candidate for replacing with RCU at some future point, but at this stage we've gone for the obvious fix. The second issue is that gfs2_write_inode() can end up calling a glock recursively when called from gfs2_evict_inode() via the syncing code, so it needs a guard added. The third issue is that we either need to not truncate the metadata pages of inodes which have zero link count, but which we cannot deallocate due to them still being in use by other nodes, or we need to ensure that those pages have all made it through the journal and ail lists first. This patch takes the former approach, but the latter has also been tested and there is nothing to choose between them performance-wise. So again, we could revise that decision in the future. Also, the inode eviction process is now better documented. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Tested-by: NBob Peterson <rpeterso@redhat.com> Tested-by: NAbhijith Das <adas@redhat.com> Reported-by: NBarry J. Marson <bmarson@redhat.com> Reported-by: NDavid Teigland <teigland@redhat.com>
-
- 12 7月, 2011 2 次提交
-
-
由 Steven Whitehouse 提交于
There is a potential race during filesystem mounting which has recently been reported. It occurs when the userland gfs_controld is able to process requests fast enough that it tries to use the sysfs interface before the lock module is properly initialised. This is a pretty unusual case as normally the lock module initialisation is very quick compared with gfs_controld. This patch adds an interruptible completion which is used to ensure that userland will wait for the initialisation of the lock module to complete. There are other potential solutions to this problem, but this is the quickest at this stage and has been tested both with and without mount.gfs2 present in the system. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Reported-by: NDavid Booher <dbooher@adams.net>
-
由 Benjamin Marzinski 提交于
Right now, there is nothing that forces the log to get flushed when a node drops its rindex glock so that another node can grow the filesystem. If the log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the following way. A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is dropped so the other node can grow the filesystem. When the node reacquires the rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being removed from the list by gfs2_log_flush(). This code simply forces a log flush when the rindex glock is invalidated, solving the problem. Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-