提交 · 534029e2fd06ac9a5a1b33b2735e3ac3242adb29 · openanolis / cloud-kernel

21 10月, 2011 13 次提交

由 Steven Whitehouse 提交于 9月 01, 2011

Given that a resource group has been locked, there is no reason why
we should not be able to allocate as many blocks as are free. The
al_requested parameter should really be considered as a minimum
number of blocks to be available. Should this limit be overshot,
there are other mechanisms which will prevent over allocation.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

534029e2

GFS2: Cache the most recently used resource group in the inode · 54335b1f

由 Steven Whitehouse 提交于 9月 01, 2011

This means that after the initial allocation for any inode, the
last used resource group is cached in the inode for future use.
This drastically reduces the number of lookups of resource
groups in the common case, and this the contention on that
data structure.

The allocation algorithm is the same as previously, except that we
always check to see if the goal block is within the cached rgrp
first before going to the rbtree to look one up.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

54335b1f

GFS2: Make resource groups "append only" during life of fs · 8339ee54

由 Steven Whitehouse 提交于 8月 31, 2011

Since we have ruled out supporting online filesystem shrink,
it is possible to make the resource group list append only
during the life of a super block. This gives several benefits:

Firstly, we only need to read new rindex elements as they are added
rather than needing to reread the whole rindex file each time one
element is added.

Secondly, the rindex glock can be held for much shorter periods of
time, and is completely removed from the fast path for allocations.
The lock is taken in shared mode only when updating the resource
groups when the first allocation occurs, and after a grow has
taken place.

Thirdly, this results in a reduction in code size, and everything
gets a lot simpler to understand in this area.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

8339ee54

GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621

由 Bob Peterson 提交于 8月 31, 2011

Here is an update of Bob's original rbtree patch which, in addition, also
resolves the rather strange ref counting that was being done relating to
the bitmap blocks.

Originally we had a dual system for journaling resource groups. The metadata
blocks were journaled and also the rgrp itself was added to a list. The reason
for adding the rgrp to the list in the journal was so that the "repolish
clones" code could be run to update the free space, and potentially send any
discard requests when the log was flushed. This was done by comparing the
"cloned" bitmap with what had been written back on disk during the transaction
commit.

Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
until the journal had been flushed. For that reason, there was a rather
complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
count on the buffers.

However, the journal maintains a reference count on the buffers anyway, since
they are being journaled as metadata buffers. So by moving the code which deals
with the post-journal accounting for bitmap blocks to the metadata journaling
code, we can entirely dispense with the rather strange buffer ref counting
scheme and also the requirement to journal the rgrps.

The net result of all this is that the ->sd_rindex_spin is left to do exactly
one job, and that is to look after the rbtree or rgrps.

This patch is designed to be a stepping stone towards using RCU for the rbtree
of resource groups, however the reduction in the number of uses of the
->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
anyway.

The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
be removed in future in favour of calling the functions directly where required
in the code. That will allow locking of resource groups without needing to
actually read them in - something that could be useful in speeding up statfs.

In the mean time though it is valid to dereference ->bi_bh only when the rgrp
is locked. This is basically the same rule as before, modulo the references not
being valid until the following journal flush.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>

7c9ca621

GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added · 9453615a

由 Steven Whitehouse 提交于 8月 23, 2011

We need to take the inode's glock whenever the inode's size
is referenced, otherwise it might not be uptodate. Even
though generic_file_llseek_unlocked() doesn't implement
SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
size in those cases, so we need to add them to the list
of origins which need the glock.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>

9453615a

GFS2: Clean up gfs2_create · 9a63edd1

由 Steven Whitehouse 提交于 8月 18, 2011

If we pass through knowledge of whether the creation is intended to be
exclusive or not, then we can deal with that in gfs2_create_inode
and remove one set of locking. Also this removes the loop in
gfs2_create and simplifies the code a bit.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

9a63edd1

GFS2: Use ->dirty_inode() · ab9bbda0

由 Steven Whitehouse 提交于 8月 15, 2011

The aim of this patch is to use the newly enhanced ->dirty_inode()
super block operation to deal with atime updates, rather than
piggy backing that code into ->write_inode() as is currently
done.

The net result is a simplification of the code in various places
and a reduction of the number of gfs2_dinode_out() calls since
this is now implied by ->dirty_inode().

Some of the mark_inode_dirty() calls have been moved under glocks
in order to take advantage of then being able to avoid locking in
->dirty_inode() when we already have suitable locks.

One consequence is that generic_write_end() now correctly deals
with file size updates, so that we do not need a separate check
for that afterwards. This also, indirectly, means that fdatasync
should work correctly on GFS2 - the current code always syncs the
metadata whether it needs to or not.

Has survived testing with postmark (with and without atime) and
also fsx.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

ab9bbda0

GFS2: Fix bug trap and journaled data fsync · f1818529

由 Steven Whitehouse 提交于 8月 05, 2011

Journaled data requires that a complete flush of all dirty data for
the file is done, in order that the ail flush which comes after
will succeed.

Also the recently enhanced bug trap can trigger falsely in case
an ail flush from fsync races with a page read. This updates the
bug trap such that it will ignore buffers which are locked and
only trigger on dirty and/or pinned buffers when the ail flush
is run from fsync. The original bug trap is retained when ail
flush is run from ->go_sync()
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

f1818529

GFS2: Fix inode allocation error path · 40ac218f

由 Steven Whitehouse 提交于 8月 02, 2011

If we have got far enough through the inode allocation code
path that an inode has already been allocated, then we must
call iput to dispose of it, if an error occurs during a
later part of the process. This will always be the final iput
since there will be no other references to the inode.

Unlike when the inode has been unlinked, its block state will
be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
to skip the test in ->evict_inode() for this one case in order
to ensure that it will be deallocated correctly. This patch adds
a new flag in order to ensure that this will happen correctly.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

40ac218f

GFS2: Make atime checks more efficient · 1d4ec642

由 Steven Whitehouse 提交于 8月 02, 2011

We do not need to start a transaction unless the atime
check has proved positive. Also if we are going to flush
the complete ail list anyway, we might as well skip the
writeback for this specific inode's metadata, since that
will be done as part of the ail writeback process in an
order offering potentially more efficient I/O.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

1d4ec642

GFS2: Fix bug-trap in ail flush code · 75549186

由 Steven Whitehouse 提交于 8月 02, 2011

The assert was being tested under the wrong lock, a
legacy of the original code. Also, if it does trigger,
the resulting information was not always a lot of help.

This moves the patch under the correct lock and also
prints out more useful information in tacking down the
source of the problem.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

75549186

GFS2: Split data write & wait in fsync · 2f0264d5

由 Steven Whitehouse 提交于 7月 27, 2011

Now that the data writing is part of fsync proper, we can split
the waiting part out and do it later on. This reduces the
number of waits that we do during fsync on average.

There is also no need to take the i_mutex unless we are flushing
metadata to disk, so we can move that to within the metadata
flushing code.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

2f0264d5

GFS2: Clean up dir hash table reading · 4c28d338

由 Steven Whitehouse 提交于 7月 26, 2011

Since there is now only a single caller to gfs2_dir_read_data()
and it has a number of constant arguments, we can factor
those out. Also some tests relating to the inode size were
being done twice.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

4c28d338

23 8月, 2011 2 次提交

block: separate priority boosting from REQ_META · 65299a3b

由 Christoph Hellwig 提交于 8月 23, 2011

Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.

All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

65299a3b

block: remove READ_META and WRITE_META · 5dc06c5a

由 Christoph Hellwig 提交于 8月 23, 2011

Replace all occurnanced of the undocumented READ_META with READ | REQ_META
and remove the unused WRITE_META define.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5dc06c5a

01 8月, 2011 2 次提交

switch posix_acl_equiv_mode() to umode_t * · d6952123

由 Al Viro 提交于 7月 23, 2011

... so that &inode->i_mode could be passed to it
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d6952123

A
switch posix_acl_create() to umode_t * · d3fb6120
由 Al Viro 提交于 7月 23, 2011
```
so we can pass &inode->i_mode to it
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
d3fb6120

27 7月, 2011 1 次提交

atomic: use <linux/atomic.h> · 60063497

由 Arun Sharma 提交于 7月 26, 2011

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: NArun Sharma <asharma@fb.com>
Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

60063497

26 7月, 2011 5 次提交

GFS2: Fix mount hang caused by certain access pattern to sysfs files · 19237039

由 Steven Whitehouse 提交于 7月 26, 2011

Depending upon the order of userspace/kernel during the
mount process, this can result in a hang without the
_all version of the completion.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

19237039

fs: take the ACL checks to common code · 4e34e719

由 Christoph Hellwig 提交于 7月 23, 2011

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4e34e719

kill boilerplates around posix_acl_create_masq() · 826cae2f

由 Al Viro 提交于 7月 23, 2011

new helper: posix_acl_create(&acl, gfp, mode_p).  Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error.  All callers of posix_acl_create_masq()
switched.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

826cae2f

kill boilerplate around posix_acl_chmod_masq() · bc26ab5f

由 Al Viro 提交于 7月 23, 2011

new helper: posix_acl_chmod(&acl, gfp, mode).  Replaces acl with modified
clone or with NULL if that has failed; returns 0 or -ve on error.  All
callers of posix_acl_chmod_masq() switched to that - they'd been doing
exactly the same thing.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bc26ab5f

vfs: move ACL cache lookup into generic code · e77819e5

由 Linus Torvalds 提交于 7月 22, 2011

This moves logic for checking the cached ACL values from low-level
filesystems into generic code.  The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.

The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e77819e5

21 7月, 2011 3 次提交

simplify gfs2_lookup() · 6c673ab3

由 Al Viro 提交于 7月 17, 2011

d_splice_alias() will DTRT when given NULL or ERR_PTR
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6c673ab3

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82

由 Josef Bacik 提交于 7月 16, 2011

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

02c24a82

fs: move inode_dio_wait calls into ->setattr · 562c72aa

由 Christoph Hellwig 提交于 6月 24, 2011

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

562c72aa

20 7月, 2011 5 次提交
- A
  ->permission() sanitizing: don't pass flags to ->permission() · 10556cb2
  由 Al Viro 提交于 6月 20, 2011
```
not used by the instances anymore.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  10556cb2
- A
  ->permission() sanitizing: don't pass flags to generic_permission() · 2830ba7f
  由 Al Viro 提交于 6月 20, 2011
```
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  2830ba7f
- A
  ->permission() sanitizing: don't pass flags to ->check_acl() · 7e40145e
  由 Al Viro 提交于 6月 20, 2011
```
not used in the instances anymore.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7e40145e
- A
  ->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() · 9c2c7039
  由 Al Viro 提交于 6月 20, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  9c2c7039
- A
  kill check_acl callback of generic_permission() · 178ea735
  由 Al Viro 提交于 6月 20, 2011
```
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  178ea735
15 7月, 2011 4 次提交

GFS2: combine duplicated block freeing routines · 46fcb2ed

由 Eric Sandeen 提交于 6月 23, 2011

__gfs2_free_data and __gfs2_free_meta are almost identical, and
can be trivially combined.

[This is as per Eric's original patch minus gfs2_free_data() which had
 no callers left and plus the conversion of the bmap.c calls to these
 functions. All in all, a nice clean up]
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

46fcb2ed

GFS2: Add S_NOSEC support · 9964afbb

由 Steven Whitehouse 提交于 6月 16, 2011

This adds S_NOSEC support to GFS2. We set/reset the flag either when
a user calls setattr or when we have just regained the glock
from another node. The flag is only set if there are no xattrs
on the inode and there is no suid bit set.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>

9964afbb

GFS2: Automatically adjust glock min hold time · 7cf8dcd3

由 Bob Peterson 提交于 6月 15, 2011

This patch is a performance improvement for GFS2 in a clustered
environment. It makes the glock hold time self-adjusting.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

7cf8dcd3

GFS2: Cache dir hash table in a contiguous buffer · 17d539f0

由 Steven Whitehouse 提交于 6月 15, 2011

This patch adds a cache for the hash table to the directory code
in order to help simplify the way in which the hash table is
accessed. This is intended to be a first step towards introducing
some performance improvements in the directory code.

There are two follow ups that I'm hoping to see fairly shortly. One
is to simplify the hash table reading code now that we always read the
complete hash table, whether we want one entry or all of them. The
other is to introduce readahead on the heads of the hash chains
which are referred to from the table.

The hash table is a maximum of 128k in size, so it is not worth trying
to read it in small chunks.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

17d539f0

14 7月, 2011 1 次提交

GFS2: Resolve inode eviction and ail list interaction bug · 380f7c65

由 Steven Whitehouse 提交于 7月 14, 2011

This patch contains a few misc fixes which resolve a recently
reported issue. This patch has been a real team effort and has
received a lot of testing.

The first issue is that the ail lock needs to be held over a few
more operations. The lock thats added into gfs2_releasepage() may
possibly be a candidate for replacing with RCU at some future
point, but at this stage we've gone for the obvious fix.

The second issue is that gfs2_write_inode() can end up calling
a glock recursively when called from gfs2_evict_inode() via the
syncing code, so it needs a guard added.

The third issue is that we either need to not truncate the metadata
pages of inodes which have zero link count, but which we cannot
deallocate due to them still being in use by other nodes, or we need
to ensure that those pages have all made it through the journal and
ail lists first. This patch takes the former approach, but the
latter has also been tested and there is nothing to choose between
them performance-wise. So again, we could revise that decision
in the future.

Also, the inode eviction process is now better documented.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
Tested-by: NBob Peterson <rpeterso@redhat.com>
Tested-by: NAbhijith Das <adas@redhat.com>
Reported-by: NBarry J. Marson <bmarson@redhat.com>
Reported-by: NDavid Teigland <teigland@redhat.com>

380f7c65

12 7月, 2011 2 次提交

GFS2: Fix race during filesystem mount · 3942ae53

由 Steven Whitehouse 提交于 7月 11, 2011

There is a potential race during filesystem mounting which has recently
been reported. It occurs when the userland gfs_controld is able to
process requests fast enough that it tries to use the sysfs interface
before the lock module is properly initialised. This is a pretty
unusual case as normally the lock module initialisation is very quick
compared with gfs_controld.

This patch adds an interruptible completion which is used to ensure that
userland will wait for the initialisation of the lock module to
complete.

There are other potential solutions to this problem, but this is the
quickest at this stage and has been tested both with and without
mount.gfs2 present in the system.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
Reported-by: NDavid Booher <dbooher@adams.net>

3942ae53

GFS2: force a log flush when invalidating the rindex glock · 1ce53368

由 Benjamin Marzinski 提交于 6月 13, 2011

Right now, there is nothing that forces the log to get flushed when a node
drops its rindex glock so that another node can grow the filesystem. If the
log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the
following way.

A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is
dropped so the other node can grow the filesystem. When the node reacquires the
rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being
removed from the list by gfs2_log_flush().

This code simply forces a log flush when the rindex glock is invalidated,
solving the problem.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

1ce53368

26 5月, 2011 1 次提交

gfs2: Drop __TIME__ usage · 8d2c50e3

由 Michal Marek 提交于 4月 01, 2011

The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: NMichal Marek <mmarek@suse.cz>

8d2c50e3

25 5月, 2011 1 次提交

vmscan: change shrinker API by passing shrink_control struct · 1495f230

由 Ying Han 提交于 5月 24, 2011

Change each shrinker's API by consolidating the existing parameters into
shrink_control struct.  This will simplify any further features added w/o
touching each file of shrinker.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: NYing Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: NPavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1495f230

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功