提交 · dcc6d073225b6b732a52477c91bd4edc9b4d5502 · bug2833 / cloud-kernel

21 5月, 2011 1 次提交

btrfs: implement delayed inode items operation · 16cdcec7

由 Miao Xie 提交于 4月 22, 2011

Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dave@jikos.cz>
Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

16cdcec7

18 5月, 2011 4 次提交

configfs: Fix race between configfs_readdir() and configfs_d_iput() · 24307aa1

由 Joel Becker 提交于 5月 18, 2011

configfs_readdir() will use the existing inode numbers of inodes in the
dcache, but it makes them up for attribute files that aren't currently
instantiated.  There is a race where a closing attribute file can be
tearing down at the same time as configfs_readdir() is trying to get its
inode number.

We want to get the inode number of open attribute files, because they
should match while instantiated.  We can't lock down the transition
where dentry->d_inode is set to NULL, so we just check for NULL there.
We can, however, ensure that an inode we find isn't iput() in
configfs_d_iput() until after we've accessed it.
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

24307aa1

configfs: Don't try to d_delete() negative dentries. · df7f9967

由 Joel Becker 提交于 2月 22, 2011

When configfs is faking mkdir() on its subsystem or default group
objects, it starts by adding a negative dentry.  It then tries to
instantiate the group.  If that should fail, it must clean up after
itself.

I was using d_delete() here, but configfs_attach_group() promises to
return an empty dentry on error.  d_delete() explodes with the entry
dentry.  Let's try d_drop() instead.  The unhashing is what we want for
our dentry.
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

df7f9967

cifs: fix cifsConvertToUCS() for the mapchars case · 11379b5e

由 Jeff Layton 提交于 5月 17, 2011

As Metze pointed out, commit 84cdf74e broke mapchars option:

    Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
    (84cdf74e) does multiple steps
    in just one commit (moving the function and changing it without
    testing).

    put_unaligned_le16(temp, &target[j]); is never called for any
    codepoint the goes via the 'default' switch statement. As a result
    we put just zero (or maybe uninitialized) bytes into the target
    buffer.

His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.

Cc: <stable@kernel.org> # .38.x: 581ade4d: cifs: clean up various nits in unicode routines (try #2)
Reported-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

11379b5e

cifs: add fallback in is_path_accessible for old servers · 221d1d79

由 Jeff Layton 提交于 5月 17, 2011

The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.

Cc: stable@kernel.org
Reported-and-Tested-by: NSandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

221d1d79

15 5月, 2011 5 次提交

Btrfs: fix FS_IOC_SETFLAGS ioctl · ebcb904d

由 Li Zefan 提交于 4月 15, 2011

Steps to reproduce the bug:

  - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
  - Call FS_IOC_SETFLAGS ioctl with flags=0
  - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ebcb904d

Btrfs: fix FS_IOC_GETFLAGS ioctl · d0092bdd

由 Li Zefan 提交于 4月 15, 2011

As we've added per file compression/cow support.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d0092bdd

fs: remove FS_COW_FL · e1e8fb6a

由 Li Zefan 提交于 4月 15, 2011

FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.

The fact is we don't have corresponding BTRFS_INODE_COW flag.

COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.

If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1e8fb6a

Btrfs: fix easily get into ENOSPC in mixed case · 1aba86d6

由 liubo 提交于 4月 08, 2011

When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.

In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning.  The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1aba86d6

Prevent oopsing in posix_acl_valid() · f5de9391

由 Daniel J Blueman 提交于 5月 03, 2011

If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.
Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f5de9391

14 5月, 2011 8 次提交

vfs: micro-optimize acl_permission_check() · 26cf46be

由 Linus Torvalds 提交于 5月 13, 2011

It's a hot function, and we're better off not mixing types in the mask
calculations.  The compiler just ends up mixing 16-bit and 32-bit
operations, for no good reason.

So do everything in 'unsigned int' rather than mixing 'unsigned int'
masking with a 'umode_t' (16-bit) mode variable.

This, together with the parent commit (47a150ed: "Cache user_ns in
struct cred") makes acl_permission_check() much nicer.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

26cf46be

ocfs2/dlm: Target node death during resource migration leads to thread spin · df016c66

由 Sunil Mushran 提交于 5月 04, 2011

During resource migration, if the target node were to die, the thread doing
the migration spins until the target node is not removed from the domain map.
This patch slows the spin by making the thread wait for the recovery to kick in.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

df016c66

ocfs2: Skip mount recovery for hard-ro mounts · 10b3dd76

由 Sunil Mushran 提交于 5月 04, 2011

Patch skips mount recovery for hard-ro mounts which otherwise leads to an oops.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

10b3dd76

ocfs2/cluster: Heartbeat mismatch message improved · 33c12a54

由 Sunil Mushran 提交于 5月 04, 2011

If o2hb finds unexpected values in the heartbeat slot, it prints a message
"ERROR: Device "dm-6": another node is heartbeating in our slot!"

This message could be misleading. This patch adds two more messages to
help users better diagnose the problem.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

33c12a54

ocfs2/cluster: Increase the live threshold for global heartbeat · 76d9fc29

由 Sunil Mushran 提交于 5月 04, 2011

We have seen isolated cases (very few, I might add) of o2hb not detecting all
live nodes on startup. One plausible reasoning for it is that other node had
a hb io delay at the same time. The live threshold set at 2 (as low as it can
be) could be increased to ameliorate the situation.

But increasing the threshold directly affects mount time. Currently it takes
around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing
the threshold will make mounts even slower. As the issue itself is rare, we have
left things as they are for the local heartbeat mode.

However we can improve the situation for global heartbeat mode as in that mode,
we start the heartbeat much before the mount.

This patch doubles the live threshold for the start of the first region in
global heartbeat mode.

Addresses internal Oracle bug#10635585.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

76d9fc29

ocfs2/dlm: Use negotiated o2dlm protocol version · 4da6dc29

由 Sunil Mushran 提交于 5月 04, 2011

Patch fixes a bug in the o2dlm protocol negotiation in that it is using
the builtin version rather than the negotiated version during the domain
join. This causes join errors when a node having kernel >= 2.6.37 joins
a cluster with nodes having kernels < 2.6.37.

This only affects the o2cb cluster stack.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
Reported-by: NJacek Stepniewski <Jacek.Stepniewski@agora.pl>
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

4da6dc29

ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes. · 9a790ba1

由 Tristan Ye 提交于 5月 12, 2011

In the case of removing a partial extent record which covers a hole, current
punching-hole logic will try to remove more than the length of whole extent
record, which leads to the failure of following assert(fs/ocfs2/alloc.c):

5507 BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range);

This patch tries to skip existing hole at the last attempt of removing a partial
extent record, what's more, it also adds some necessary comments for better
understanding of punching-hole codes.
Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

9a790ba1

ocfs2: Initialize data_ac (might be used uninitialized) · 5d44670f

由 Marcus Meissner 提交于 5月 05, 2011

CLANG found that there is a path that has data_ac uninitialized,
this place
	2917	/* This gets us the dx_root */
	2918	ret = ocfs2_reserve_new_metadata_blocks(osb, 1, &meta_ac);
	2919	if (ret) {

	3
		Taking true branch
	2920	mlog_errno(ret);
	2921	goto out;

	4
		Control jumps to line 3168
	2922	}

Goes to the out: label without data_ac being initialized.

Ciao, Marcus
Signed-Off-By: NMarcus Meissner <meissner@suse.de>
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NJoel Becker <jlbec@evilplan.org>

5d44670f

12 5月, 2011 6 次提交

NFSv4.1: Ensure that layoutget uses the correct gfp modes · a75b9df9

由 Trond Myklebust 提交于 5月 11, 2011

Currently, writebacks may end up recursing back into the filesystem due to
GFP_KERNEL direct reclaims in the pnfs subsystem.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

a75b9df9

NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list · 2887fe45

由 Andy Adamson 提交于 5月 11, 2011

Prevents an infinite loop as list was never emptied.
Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

2887fe45

NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP · a8a4ae3a

由 Andy Adamson 提交于 5月 03, 2011

Free the slot and resend the RPC with new session <slot#,seq#>.

For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0
to restart the async rpc in the rpc_restart_call_prepare state which resets
the slot.

For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will
reset the slot via nfs41_call_sync_prepare.

For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write
cachethis is true, so these operations will not trigger an
NFS4ERR_RETRY_UNCACHED_REP.
Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

a8a4ae3a

ceph: do not use i_wrbuffer_ref as refcount for Fb cap · d3d0720d

由 Henry C Chang 提交于 5月 11, 2011

We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.
Signed-off-by: NHenry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: NSage Weil <sage@newdream.net>

d3d0720d

ceph: fix list_add in ceph_put_snap_realm · a26a185d

由 Henry C Chang 提交于 5月 11, 2011

Signed-off-by: NHenry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: NSage Weil <sage@newdream.net>

a26a185d

ceph: print debug message before put mds session · 7d8e18a6

由 Henry C Chang 提交于 5月 11, 2011

The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.
Signed-off-by: NHenry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: NSage Weil <sage@newdream.net>

7d8e18a6

10 5月, 2011 16 次提交

fuse: fix oops in revalidate when called with NULL nameidata · d2433905

由 Miklos Szeredi 提交于 5月 10, 2011

Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL
nameidata.

https://bugzilla.kernel.org/show_bug.cgi?id=34732

Tyler Hicks pointed out that this bug was introduced by commit
e7c0a167 "fuse: make fuse_dentry_revalidate() RCU aware"
Reported-by: NWitold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

d2433905

nilfs2: fix infinite loop in nilfs_palloc_freev function · 349dbc36

由 Ryusuke Konishi 提交于 5月 10, 2011

After having applied commit 9954e7af ("nilfs2: add free
entries count only if clear bit operation succeeded"), a free routine
of nilfs came to fall into an infinite loop, outputting the same
message endlessly:

 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed
 nilfs_palloc_freev: entry number 29497 already freed ...

That patch broke the routine so that a loop counter is never updated
in an abnormal state.  This fixes the regression.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

349dbc36

xfs: fix race condition in AIL push trigger · 7ac95657