提交 · 5e98d492406818e6a94c0ba54c61f59d40cefa4a · openeuler / Kernel

11 9月, 2010 1 次提交

由 Goldwyn Rodrigues 提交于 6月 28, 2010

Track negative dentries by recording the generation number of the parent
directory in d_fsdata. The generation number for the parent directory is
recorded in the inode_info, which increments every time the lock on the
directory is dropped.

If the generation number of the parent directory and the negative dentry
matches, there is no need to perform the revalidate, else a revalidate
is forced. This improves performance in situations where nodes look for
the same non-existent file multiple times.

Thanks Mark for explaining the DLM sequence.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

5e98d492

10 9月, 2010 13 次提交

ocfs2: Cache system inodes of other slots. · b4d693fc

由 Tao Ma 提交于 8月 16, 2010

Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
   we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
   the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
   since we have no dcache for this inode, i_count will
   reach 0, and VFS will have to call clear_inode and in
   ocfs2_clear_inode we will checkpoint the inode which will let
   ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.

So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.

So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

b4d693fc

libfs: Fix shift bug in generic_check_addressable() · a33f13ef

由 Joel Becker 提交于 8月 16, 2010

generic_check_addressable() erroneously shifts pages down by a block
factor when it should be shifting up.  To prevent overflow, we shift
blocks down to pages.
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

a33f13ef

OCFS2: Allow huge (> 16 TiB) volumes to mount · 3bdb8efd

由 Patrick J. LoPresti 提交于 7月 22, 2010

The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB.  But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.

This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.

I factored out the sanity check into its own function.  I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.

This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").

I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal.  All of them appear
to work or to fail gracefully, as appropriate.
Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

3bdb8efd

JBD2: Allow feature checks before journal recovery · 1113e1b5

由 Patrick J. LoPresti 提交于 7月 22, 2010

Before we start accessing a huge (> 16 TiB) OCFS2 volume, we need to
confirm that its journal supports 64-bit offsets.  In particular, we
need to check the journal's feature bits before recovering the journal.

This is not possible with JBD2 at present, because the journal
superblock (where the feature bits reside) is not loaded from disk until
the journal is recovered.

This patch loads the journal superblock in
jbd2_journal_check_used_features() if it has not already been loaded,
allowing us to check the feature bits before journal recovery.
Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: N"Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

1113e1b5

ext3/ext4: Factor out disk addressability check · 30ca22c7

由 Patrick J. LoPresti 提交于 7月 22, 2010

As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

30ca22c7

T
ocfs2: Remove obsolete comments before ocfs2_start_trans. · 17ae5211
由 Tao Ma 提交于 8月 02, 2010
```
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>
```
17ae5211
T
ocfs2: Remove unused old_id in ocfs2_commit_cache. · f9c57ada
由 Tao Ma 提交于 8月 02, 2010
```
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>
```
f9c57ada

ocfs2: Remove ocfs2_sync_inode() · 4c38881f

由 Jan Kara 提交于 8月 05, 2010

ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
is superfluous as well.
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

4c38881f

Reorganize data elements to reduce struct sizes · 83fd9c7f

由 Goldwyn Rodrigues 提交于 6月 10, 2010

Thanks for the comments. I have incorportated them all.

CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

83fd9c7f

ocfs2: Remove obscure error handling in direct_write. · 95fa859a

由 Tao Ma 提交于 6月 09, 2010

In ocfs2, actually we don't allow any direct write pass i_size,
see the function ocfs2_prepare_inode_for_write. So we don't
need the bogus simple_setsize.
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

95fa859a

ocfs2: Add some trace log for orphan scan. · 3c3f20c9

由 Tao Ma 提交于 6月 01, 2010

Now orphan scan worker has no trace log, so it is
very hard to tell whether it is finished or blocked.
So add 2 mlog trace log so that we can tell whether
the current orphan scan worker is blocked or not.
It does help when I analyzed a orphan scan bug.
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

3c3f20c9

Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8. · ddee5cdb

由 Tristan Ye 提交于 5月 22, 2010

The reason why we need this ioctl is to offer the none-privileged
end-user a possibility to get filesys info gathering.

We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
structure to kernel containing an array of request pointers and request
count, such as,

* From userspace:

struct ocfs2_info_blocksize oib = {
        .ib_req = {
                .ir_magic = OCFS2_INFO_MAGIC,
                .ir_code = OCFS2_INFO_BLOCKSIZE,
                ...
        }
        ...
}

struct ocfs2_info_clustersize oic = {
        ...
}

uint64_t reqs[2] = {(unsigned long)&oib,
                    (unsigned long)&oic};

struct ocfs2_info info = {
        .oi_requests = reqs,
        .oi_count = 2,
}

ret = ioctl(fd, OCFS2_IOC_INFO, &info);

* In kernel:

Get the request pointers from *info*, then handle each request one bye one.

Idea here is to make the spearated request small enough to guarantee
a better backward&forward compatibility since a small piece of request
would be less likely to be broken if filesys on raw disk get changed.

Currently, the following 7 requests are supported per the requirement from
userspace tool o2info, and I believe it will grow over time:-)

        OCFS2_INFO_CLUSTERSIZE
        OCFS2_INFO_BLOCKSIZE
        OCFS2_INFO_MAXSLOTS
        OCFS2_INFO_LABEL
        OCFS2_INFO_UUID
        OCFS2_INFO_FS_FEATURES
        OCFS2_INFO_JOURNAL_SIZE

This ioctl is only specific to OCFS2.
Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

ddee5cdb

mm: Move vma_stack_continue into mm.h · 39aa3cb3

由 Stefan Bader 提交于 8月 31, 2010

So it can be used by all that need to check for that.
Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

39aa3cb3

08 9月, 2010 14 次提交

ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan · 97b8f4a9

由 Mark Fasheh 提交于 8月 13, 2010

ocfs2_create_inode_in_orphan() is used by reflink to create the newly
reflinked inode simultaneously in the orphan dir. This allows us to easily
handle partially-reflinked files during recovery cleanup.

We have a problem though - the orphan dir stringifies inode # to determine
a unique name under which the orphan entry dirent can be created. Since
ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
before it can allocate the inode, we currently call into the orphan code:

       /*
        * We give the orphan dir the root blkno to fake an orphan name,
        * and allocate enough space for our insertion.
        */
       status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
                                         osb->root_blkno,
                                         orphan_name, &orphan_insert);

Using osb->root_blkno might work fine on unindexed directories, but the
orphan dir can have an index.  When it has that index, the above code fails
to allocate the proper index entry.  Later, when we try to remove the file
from the orphan dir (using the actual inode #), the reflink operation will
fail.

To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
newly split out orphan and inode alloc code to figure out what the inode
block number will be (once allocated) and then prepare the orphan dir from
that data.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

97b8f4a9

ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions · dd43bcde

由 Mark Fasheh 提交于 8月 13, 2010

We do this because ocfs2_create_inode_in_orphan() wants to order locking of
the orphan dir with respect to locking of the inode allocator *before*
making any changes to the directory.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

dd43bcde

ocfs2: allow return of new inode block location before allocation of the inode · e49e2767

由 Mark Fasheh 提交于 8月 13, 2010

This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir).  Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

e49e2767

ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding · d5134982

由 Mark Fasheh 提交于 8月 13, 2010

ocfs2_search_chain() makes the same updates as
ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
the bitmap update, use our helper function.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

d5134982

ocfs2: split out inode alloc code from ocfs2_mknod_locked · 021960ca

由 Mark Fasheh 提交于 8月 13, 2010

Do this by splitting the bulk of the function away from the inode allocation
code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
change and won't see any difference. The new function created,
__ocfs2_mknod_locked() will be used shortly.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

021960ca

Ocfs2: Fix a regression bug from mainline commit(). · 81c8c82b

由 Tristan Ye 提交于 8月 19, 2010

The patch is to fix the regression bug brought from commit 6b933c8e...( 'ocfs2:
Avoid direct write if we fall back to buffered I/O'):

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285

The commit 6b933c8e changed __generic_file_aio_write
to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to  flush
the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
the O_DIRECT semantics somehow in extented odirect writes.

This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
flushed.
Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

81c8c82b

ocfs2: Fix deadlock when allocating page · 9b4c0ff3

由 Jan Kara 提交于 8月 24, 2010

We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

9b4c0ff3

ocfs2: properly set and use inode group alloc hint · b2b6ebf5

由 Mark Fasheh 提交于 8月 26, 2010

We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
res->sr_bg_blkno.  Unfortunately, res->sr_bg_blkno is going to be zero under
normal (non-fragmented) circumstances. The discontig block group patches
effectively turned off that feature. Fix this by correctly calculating what
the next group hint should be.
Acked-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

b2b6ebf5

ocfs2: Use the right group in nfs sync check. · 889f004a

由 Tao Ma 提交于 9月 02, 2010

We have added discontig block group now, and now an inode
can be allocated in an discontig block group. So get
it in ocfs2_get_suballoc_slot_bit.

The old ocfs2_test_suballoc_bit gets group block no
from the allocation inode which is wrong. Fix it by
passing the right group.
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

889f004a

ocfs2: Flush drive's caches on fdatasync · 04eda1a1

由 Jan Kara 提交于 8月 05, 2010

When 'barrier' mount option is specified, we have to issue a cache flush
during fdatasync(2). We have to do this even if inode doesn't have
I_DIRTY_DATASYNC set because we still have to get written *data* to disk so
that they are not lost in case of crash.
Acked-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Singed-off-by: NTao Ma <tao.ma@oracle.com>

04eda1a1

ocfs2: make __ocfs2_page_mkwrite handle file end properly. · f63afdb2

由 Tao Ma 提交于 7月 17, 2010

__ocfs2_page_mkwrite now is broken in handling file end.
1. the last page should be the page contains i_size - 1.
2. the len in the last page is also calculated wrong.
So change them accordingly.
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>

f63afdb2

ocfs2: Fix incorrect checksum validation error · f5ce5a08

由 Sunil Mushran 提交于 8月 12, 2010

For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to
read the inode off the disk. The latter first checks to see if that block is
cached in the journal, and, if so, returns that block. That is ok.

But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum
of such blocks. Blocks that are cached in the journal may not have had their
checksum computed as yet. We should not validate the checksums of such blocks.

Fixes ossbz#1282
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Singed-off-by: NTao Ma <tao.ma@oracle.com>

f5ce5a08

ocfs2: Fix metaecc error messages · dc696ace

由 Sunil Mushran 提交于 8月 12, 2010

Like tools, the checksum validate function now prints the values in hex.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Singed-off-by: NTao Ma <tao.ma@oracle.com>

dc696ace

VFS: Sanity check mount flags passed to change_mnt_propagation() · 7a2e8a8f

由 Valerie Aurora 提交于 8月 26, 2010

Sanity check the flags passed to change_mnt_propagation(). Exactly
one flag should be set. Return EINVAL otherwise.

Userspace can pass in arbitrary combinations of MS_* flags to mount().
do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then
calls change_mnt_propagation() with the rest of the user-supplied
flags. change_mnt_propagation() clearly assumes only one flag is set
but do_change_type() does not check that this is true. For example,
mount() with flags MS_SHARED | MS_RDONLY does not actually make the
mount shared or read-only but does clear MNT_UNBINDABLE.
Signed-off-by: NValerie Aurora <vaurora@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7a2e8a8f

07 9月, 2010 2 次提交

fuse: fix lock annotations · b9ca67b2

由 Miklos Szeredi 提交于 9月 07, 2010

Sparse doesn't understand lock annotations of the form
__releases(&foo->lock).  Change them to __releases(foo->lock).  Same
for __acquires().
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

b9ca67b2

fuse: flush background queue on connection close · 595afaf9

由 Miklos Szeredi 提交于 9月 07, 2010

David Bartly reported that fuse can hang in fuse_get_req_nofail() when
the connection to the filesystem server is no longer active.

If bg_queue is not empty then flush_bg_queue() called from
request_end() can put more requests on to the pending queue.  If this
happens while ending requests on the processing queue then those
background requests will be queued to the pending list and never
ended.

Another problem is that fuse_dev_release() didn't wake up processes
sleeping on blocked_waitq.

Solve this by:

 a) flushing the background queue before calling end_requests() on the
    pending and processing queues

 b) setting blocked = 0 and waking up processes waiting on
    blocked_waitq()

Thanks to David for an excellent bug report.
Reported-by: NDavid Bartley <andareed@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org

595afaf9

04 9月, 2010 1 次提交

sysfs: checking for NULL instead of ERR_PTR · 57f9bdac

由 Dan Carpenter 提交于 8月 25, 2010

d_path() returns an ERR_PTR and it doesn't return NULL.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Cc: stable <stable@kernel.org>
Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

57f9bdac

03 9月, 2010 3 次提交

xfs: Make fiemap work with sparse files · 9af25465

由 Tao Ma 提交于 8月 30, 2010

In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
to return fi_extent_max extents, but actually it won't work for
a sparse file. The reason is that in xfs_getbmap we will
calculate holes and set it in 'out', while out is malloced by
bmv_count(fi_extent_max+1) which didn't consider holes. So in the
worst case, if 'out' vector looks like
[hole, extent, hole, extent, hole, ... hole, extent, hole],
we will only return half of fi_extent_max extents.

This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
So with this flags, we don't use our 'out' in xfs_getbmap for
a hole. The solution is a bit ugly by just don't increasing
index of 'out' vector. I felt that it is not easy to skip it
at the very beginning since we have the complicated check and
some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.

Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NAlex Elder <aelder@sgi.com>

9af25465

xfs: prevent 32bit overflow in space reservation · 72656c46

由 Dave Chinner 提交于 9月 03, 2010

If we attempt to preallocate more than 2^32 blocks of space in a
single syscall, the transaction block reservation will overflow
leading to a hangs in the superblock block accounting code. This
is trivially reproduced with xfs_io. Fix the problem by capping the
allocation reservation to the maximum number of blocks a single
xfs_bmapi() call can allocate (2^21 blocks).
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

72656c46

J
nfsd4: mask out non-access bits in nfs4_access_to_omode · 8f34a430
由 J. Bruce Fields 提交于 9月 02, 2010
```
This fixes an unnecessary BUG().
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
```
8f34a430

02 9月, 2010 2 次提交

xfs: Disallow 32bit project quota id · 23963e54

由 Arkadiusz Mi?kiewicz 提交于 8月 26, 2010

Currently on-disk structure is able to keep only 16bit project quota
id, so disallow 32bit ones. This fixes a problem where parts of
kernel structures holding project quota id are 32bit while parts
(on-disk) are 16bit variables which causes project quota member
files to be inaccessible for some operations (like mv/rm).
Signed-off-by: NArkadiusz Mi?kiewicz <arekm@maven.pl>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

23963e54

xfs: improve buffer cache hash scalability · 9bc08a45

由 Dave Chinner 提交于 9月 02, 2010

When doing large parallel file creates on a 16p machines, large amounts of
time is being spent in _xfs_buf_find(). A system wide profile with perf top
shows this:

          1134740.00 19.3% _xfs_buf_find
           733142.00 12.5% __ticket_spin_lock

The problem is that the hash contains 45,000 buffers, and the hash table width
is only 256 buffers. That means we've got around 200 buffers per chain, and
searching it is quite expensive. The hash table size needs to increase.

Secondly, every time we do a lookup, we promote the buffer we find to the head
of the hash chain. This is causing cachelines to be dirtied and causes
invalidation of cachelines across all CPUs that may have walked the hash chain
recently. hence every walk of the hash chain is effectively a cold cache walk.
Remove the promotion to avoid this invalidation.

The results are:

          1045043.00 21.2% __ticket_spin_lock
           326184.00  6.6% _xfs_buf_find

A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
not result in an increase in performance underthis workload as contention on
the inode_lock soaks up most of the reduction in CPU usage.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

9bc08a45

30 8月, 2010 2 次提交

9p: potential ERR_PTR() dereference · 8f587df4

由 Dan Carpenter 提交于 8月 04, 2010

p9_client_walk() can return error values if we run out of space or there
is a problem with the network.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

8f587df4

nilfs2: fix leak of shadow dat inode in error path of load_nilfs · 4afc3134

由 Ryusuke Konishi 提交于 8月 29, 2010

If load_nilfs() gets an error while doing recovery, it will fail to
free the shadow inode of dat (nilfs->ns_gc_dat).

This fixes the leak issue.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

4afc3134

28 8月, 2010 2 次提交

fsnotify: drop two useless bools in the fnsotify main loop · 92b4678e

由 Eric Paris 提交于 8月 27, 2010

The fsnotify main loop has 2 bools which indicated if we processed the
inode or vfsmount mark in that particular pass through the loop. These
bool can we replaced with the inode_group and vfsmount_group variables
and actually make the code a little easier to understand.
Signed-off-by: NEric Paris <eparis@redhat.com>

92b4678e

fsnotify: fix list walk order · f72adfd5

由 Eric Paris 提交于 8月 27, 2010

Marks were stored on the inode and vfsmonut mark list in order from
highest memory address to lowest memory address.  The code to walk those
lists thought they were in order from lowest to highest with
unpredictable results when trying to match up marks from each.  It was
possible that extra events would be sent to userspace when inode
marks ignoring events wouldn't get matched with the vfsmount marks.

This problem only affected fanotify when using both vfsmount and inode
marks simultaneously.
Signed-off-by: NEric Paris <eparis@redhat.com>

f72adfd5

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功