提交 · db38d5ad323362bfca118b52fe5906f97a69fb45 · openanolis / cloud-kernel

20 11月, 2009 23 次提交

nilfs2: add cache framework for persistent object allocator · db38d5ad

由 Ryusuke Konishi 提交于 11月 14, 2009

This adds setup and cleanup routines of the persistent object
allocator cache.

According to ftrace analyses, accessing buffers of the DAT file
suffers indispensable overhead many times.  To mitigate the overhead,
This introduce cache framework for the persistent object allocator
(palloc) which the DAT file and ifile are using.

struct nilfs_palloc_cache represents the cache object per metadata
file using palloc.

The cache is initialized through nilfs_palloc_setup_cache() and
destroyed by nilfs_palloc_destroy_cache(); callers of the former
function will be added to individual allocators of DAT and ifile on
successive patches.

nilfs_palloc_destroy_cache() will be called from nilfs_mdt_destroy()
if the cache is attached to a metadata file.  A companion function
nilfs_palloc_clear_cache() is provided to allow releasing buffer head
references independently with the cleanup task.  This adjunctive
function will be used before invalidating pages of metadata file with
the cache.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

db38d5ad

nilfs2: unfold nilfs_palloc_block_get_bitmap function · 141bbdba

由 Ryusuke Konishi 提交于 11月 14, 2009

This expands a trivial address calculation in the function into its
every callsite. This expansion improves readability of the callers.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

141bbdba

nilfs2: eliminate nilfs_btnode_get function · 1376e931

由 Ryusuke Konishi 提交于 11月 13, 2009

This removes the obsolete nilfs_btnode_get() function and makes
nilfs_btree_get_block() directly call nilfs_btnode_submit_block().

This expansion will provide better opportunity for code optimization.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

1376e931

nilfs2: remove newblk argument from nilfs_btnode_submit_block · 75f65edf

由 Ryusuke Konishi 提交于 11月 13, 2009

This removes the obsolete argument from nilfs_btnode_submit_block().
This will complete separating a create function of btree node.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

75f65edf

nilfs2: use nilfs_btnode_create_block function · 45f4910b

由 Ryusuke Konishi 提交于 11月 13, 2009

This displaces nilfs_btnode_get() use to create new btree node block
with nilfs_btnode_create_block.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

45f4910b

nilfs2: separate function for creating new btree node block · d501d736

由 Ryusuke Konishi 提交于 11月 13, 2009

Adds a separate routine for creating a btree node block.  This is a
preparation to reduce the depth of function calls during submitting
btree node buffer.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

d501d736

nilfs2: avoid readahead on metadata file for create mode · b34a6506

由 Ryusuke Konishi 提交于 11月 14, 2009

This turns off readhead action of metadata file if nilfs_mdt_get_block
function was called with a create flag.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

b34a6506

nilfs2: simplify nilfs_sufile_get_ncleansegs function · ef7d4757

由 Ryusuke Konishi 提交于 11月 13, 2009

Previously, this function took an status code to return possible error
codes.  The ("nilfs2: add local variable to cache the number of clean
segments") patch removed the possibility to return errors.

So, this simplifies the function definition to make it directly return
the number of clean segments.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

ef7d4757

nilfs2: add local variable to cache the number of clean segments · aa474a22

由 Ryusuke Konishi 提交于 11月 13, 2009

This makes it possible for sufile to get the number of clean segments
faster.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

aa474a22

nilfs2: unfold nilfs_sufile_block_get_header function · 7b16c8a2

由 Ryusuke Konishi 提交于 11月 13, 2009

This unfolds the nilfs_sufile_block_get_header() function for
simplicity.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

7b16c8a2

nilfs2: hide nilfs_mdt_clear calls in nilfs_mdt_destroy · fd66c0d5

由 Ryusuke Konishi 提交于 11月 13, 2009

This will hide a function call of nilfs_mdt_clear() in
nilfs_mdt_destroy().

This ensures nilfs_mdt_destroy() to do cleanup jobs included in
nilfs_mdt_clear().
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

fd66c0d5

nilfs2: eliminate inlines to directly read/write inode of metadata files · 3961f0e2

由 Ryusuke Konishi 提交于 11月 13, 2009

Removes two inline functions: nilfs_mdt_read_inode_direct() and
nilfs_mdt_write_inode_direct().
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

3961f0e2

nilfs2: separate read method of meta data files on super root block · 8707df38

由 Ryusuke Konishi 提交于 11月 13, 2009

Will displace nilfs_mdt_read_inode_direct function with an individual
read method: nilfs_dat_read, nilfs_sufile_read, nilfs_cpfile_read.

This provides the opportunity to initialize local variables of each
metadata file after reading the inode.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

8707df38

nilfs2: separate constructor of metadata files · 79739565

由 Ryusuke Konishi 提交于 11月 12, 2009

This will displace nilfs_mdt_new() constructor with individual
metadata file constructors like nilfs_dat_new(), new_sufile_new(),
nilfs_cpfile_new(), and nilfs_ifile_new().

This makes it possible for each metadata file to have own
intialization code.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

79739565

nilfs2: add size option of private object to metadata file allocator · 5731e191

由 Ryusuke Konishi 提交于 11月 12, 2009

This adds an optional "object size" argument to nilfs_mdt_new_common()
function; the argument specifies the size of private object attached
to a newly allocated metadata file inode.

This will afford space to keep local variables for meta data files.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

5731e191

nilfs2: move out mark_inode_dirty calls from bmap routines · 9cb4e0d2

由 Ryusuke Konishi 提交于 11月 06, 2009

Previously, nilfs_bmap_add_blocks() and nilfs_bmap_sub_blocks() called
mark_inode_dirty() after they changed the number of data blocks.

This moves these calls outside bmap outermost functions like
nilfs_bmap_insert() or nilfs_bmap_truncate().

This will mitigate overhead for truncate or delete operation since
they repeatedly remove set of blocks.  Nearly 10 percent improvement
was observed for removal of a large file:

 # dd if=/dev/zero of=/test/aaa bs=1M count=512
 # time rm /test/aaa

  real  2.968s -> 2.705s

Further optimization may be possible by eliminating these
mark_inode_dirty() uses though I avoid mixing separate changes here.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

9cb4e0d2

nilfs2: stop marking metadata inode dirty within btree operations · 09bf4aae

由 Ryusuke Konishi 提交于 11月 05, 2009

Since metadata file routines mark the inode dirty after they
successfully changed bmap objects, nilfs_mdt_mark_dirty() calls in
nilfs_bmap_add_blocks() and nilfs_bmap_sub_blocks() are redundant.

This removes these overlapping calls from the bmap routines.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

09bf4aae

nilfs2: remove buffer locking from btree code · 30db4e6c

由 Ryusuke Konishi 提交于 11月 11, 2009

lock_buffer() and unlock_buffer() uses in btree.c are eliminable
because btree functions gain buffer heads through nilfs_btnode_get(),
which never returns an on-the-fly buffer.

Although nilfs_clear_dirty_page() and nilfs_copy_back_pages() in
nilfs_commit_gcdat_inode() juggle btree node buffers of DAT, this is
safe because these operations are protected by a log writer lock or
the metadata file semaphore of DAT.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

30db4e6c

nilfs2: remove buffer locking in nilfs_mark_inode_dirty · a49762fd

由 Ryusuke Konishi 提交于 11月 11, 2009

This lock is eliminable because inodes on the buffer can be updated
independently. Although a log writer also fills in bmap data on the
on-disk inodes, this update is exclusively done by a log writer lock.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

a49762fd

nilfs2: cleanup unused match_bool function · e2073e78

由 Jiro SEKIBA 提交于 11月 12, 2009

match_bool function is not used anymore.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

e2073e78

nilfs2: Using nobarrier option instead of barrier=off · 91f1953b

由 Jiro SEKIBA 提交于 11月 12, 2009

Since most of fs using nofoobar style option,
modified barrier=off option as nobarrier.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

91f1953b

nilfs2: move definition of struct nilfs_btree_node · 6600b9dd

由 Jiro SEKIBA 提交于 11月 09, 2009

This is a trivial patch to expose struct nilfs_fs_btree_node.
The struct should be exposed outside of kernel, for it is disk format.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

6600b9dd

nilfs2: get rid of BUG_ON use in btree lookup routines · 9b945d53

由 Ryusuke Konishi 提交于 10月 10, 2009

The current btree lookup routines make a kernel oops when detected
inconsistency in btree blocks.  These routines should instead return a
proper error code because the inconsistency usually comes from
corruption of on-disk metadata.

This fixes the issue by converting BUG_ON calls to proper error
handlings.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

9b945d53

18 11月, 2009 4 次提交

fcntl: rename F_OWNER_GID to F_OWNER_PGRP · 978b4053

由 Peter Zijlstra 提交于 11月 17, 2009

This is for consistency with various ioctl() operations that include the
suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
used with setpriority() and getpriority().  Also, using PGRP instead of
GID avoids confusion with the common abbreviation of "group ID".

I'm fine with anything that makes it more consistent, and if PGRP is what
is the predominant abbreviation then I see no need to further confuse
matters by adding a third one.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

978b4053

procfs: fix /proc/<pid>/stat stack pointer for kernel threads · 9ebd4eba

由 Stefani Seibold 提交于 11月 17, 2009

Fix a small issue for the stack pointer in /proc/<pid>/stat.  In case of a
kernel thread the value of the printed stack pointer should be 0.
Signed-off-by: NStefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ebd4eba

xfs: copy li_lsn before dropping AIL lock · 6c06f072

由 Nathaniel W. Turner 提交于 11月 16, 2009

Access to log items on the AIL is generally protected by m_ail_lock;
this is particularly needed when we're getting or setting the 64-bit
li_lsn on a 32-bit platform.  This patch fixes a couple places where we
were accessing the log item after dropping the AIL lock on 32-bit
machines.

This can result in a partially-zeroed log->l_tail_lsn if
xfs_trans_ail_delete is racing with xfs_trans_ail_update, and in at
least some cases, this can leave the l_tail_lsn with a zero cycle
number, which means xlog_space_left will think the log is full (unless
CONFIG_XFS_DEBUG is set, in which case we'll trip an ASSERT), leading to
processes stuck forever in xlog_grant_log_space.

Thanks to Adrian VanderSpek for first spotting the race potential and to
Dave Chinner for debug assistance.
Signed-off-by: NNathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

6c06f072

XFS bug in log recover with quota (bugzilla id 855) · 8ec6dba2

由 Jan Rekorajski 提交于 11月 16, 2009

Hi,
I was hit by a bug in linux 2.6.31 when XFS is not able to recover the
log after a crash if fs was mounted with quotas. Gory details in XFS
bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855.

It looks like wrong struct is used in buffer length check, and the following
patch should fix the problem.

xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes
long, and this is exactly what I see in system logs - "XFS: dquot too small
(104) in xlog_recover_do_dquot_trans."
Signed-off-by: NJan Rekorajski <baggins@sith.mimuw.edu.pl>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

8ec6dba2

16 11月, 2009 1 次提交

cifs: clear server inode number flag while autodisabling · f534dc99

由 Suresh Jayaraman 提交于 11月 16, 2009

Fix the commit ec06aedd that intended to turn off querying for server inode
numbers when server doesn't consistently support inode numbers. Presumably
the commit didn't actually clear the CIFS_MOUNT_SERVER_INUM flag, perhaps a
typo.
Signed-off-by: NSuresh Jayaraman <sjayaraman@suse.de>
Acked-by: NJeff Layton <jlayton@redhat.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

f534dc99

15 11月, 2009 2 次提交

nilfs2: deleted inconsistent comment in nilfs_load_inode_block() · 18dafac1

由 Jiro SEKIBA 提交于 11月 15, 2009

The comment says, "Caller of this function MUST lock s_inode_lock",
however just above the comment, it locks s_inode_lock in the function.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

18dafac1

Fix memory corruption caused by nfsd readdir+ · 479c2553

由 Petr Vandrovec 提交于 11月 14, 2009

Commit 8177e6d6 ("nfsd: clean up
readdirplus encoding") introduced single character typo in nfs3 readdir+
implementation.  Unfortunately that typo has quite bad side effects:
random memory corruption, followed (on my box) with immediate
spontaneous box reboot.

Using 'p1' instead of 'p' fixes my Linux box rebooting whenever VMware
ESXi box tries to list contents of my home directory.
Signed-off-by: NPetr Vandrovec <petr@vandrovec.name>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

479c2553

13 11月, 2009 1 次提交

nilfs2: fix lock order reversal in chcp operation · c1ea985c

由 Ryusuke Konishi 提交于 11月 12, 2009

Will fix the following lock order reversal lockdep detected:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc6 #7
-------------------------------------------------------
chcp/30157 is trying to acquire lock:
 (&nilfs->ns_mount_mutex){+.+.+.}, at: [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]

but task is already holding lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<fed7ca32>] nilfs_transaction_begin+0xba/0x110 [nilfs2]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&nilfs->ns_segctor_sem){++++.+}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c14151e2>] down_read+0x31/0x45
       [<fed6d77b>] nilfs_attach_checkpoint+0x8f/0x16b [nilfs2]
       [<fed6e393>] nilfs_get_sb+0x3e7/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #1 (&type->s_umount_key#31/1){+.+.+.}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c104c0f3>] down_write_nested+0x34/0x52
       [<c10c08fe>] sget+0x22e/0x389
       [<fed6e133>] nilfs_get_sb+0x187/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #0 (&nilfs->ns_mount_mutex){+.+.+.}:
       [<c1057727>] __lock_acquire+0xe27/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c1414d63>] mutex_lock_nested+0x41/0x23e
       [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]
       [<fed801b2>] nilfs_ioctl+0x11a/0x7da [nilfs2]
       [<c10cca12>] vfs_ioctl+0x27/0x6e
       [<c10ccf93>] do_vfs_ioctl+0x491/0x4db
       [<c10cd022>] sys_ioctl+0x45/0x5f
       [<c1002a14>] sysenter_do_call+0x12/0x32
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

c1ea985c

12 11月, 2009 9 次提交

__generic_block_fiemap(): fix for files bigger than 4GB · e04b5ef8

由 Mike Hommey 提交于 11月 11, 2009

Because of an integer overflow on start_blk, various kind of wrong results
would be returned by the generic_block_fiemap() handler, such as no
extents when there is a 4GB+ hole at the beginning of the file, or wrong
fe_logical when an extent starts after the first 4GB.
Signed-off-by: NMike Hommey <mh@glandium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@sgi.com>
Cc: Josef Bacik <jbacik@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e04b5ef8

exec: setup_arg_pages() fails to return errors · fc63cf23

由 Anton Blanchard 提交于 11月 11, 2009

In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.

Also remove a now duplicated exit path and branch to out_unlock instead.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc63cf23

fs: add missing compat_ptr handling for FS_IOC_RESVSP ioctl · 7779d7be

由 Heiko Carstens 提交于 11月 11, 2009

For FS_IOC_RESVSP and FS_IOC_RESVSP64 compat_sys_ioctl() uses its
arg argument as a pointer to userspace. However it is missing a
a call to compat_ptr() which will do a proper pointer conversion.

This was introduced with 3e63cbb1 "fs: Add new pre-allocation ioctls
to vfs for compatibility with legacy xfs ioctls".
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ankit Jain <me@ankitjain.org>
Acked-by: NChristoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: NArnd Bergmann <arndbergmann@googlemail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org>		[2.6.31.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7779d7be

pidns: fix a leak in /proc dentries and inodes with pid namespaces. · 29f12ca3

由 Sukadev Bhattiprolu 提交于 11月 11, 2009

Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

	http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them.  As a part of reaping, the container-init should flush any /proc
dentries associated with the processes.  But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING.  At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed.  But as
pointed out by Eric Biederman, after commit 0feae5c4,
shrink_dcache_parent() no longer affects other filesystems.  So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

	echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.
Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
Reported-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NJan Kara <jack@ucw.cz>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29f12ca3

fs/jbd: Export log_start_commit to fix ext3 build. · ff5e4b51

由 Stefan Schmidt 提交于 11月 12, 2009

This fixes:
ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined!
Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>

ff5e4b51

Btrfs: fix panic when trying to destroy a newly allocated · a6dbd429

由 Josef Bacik 提交于 11月 11, 2009

There is a problem where iget5_locked will look for an inode, not find it, and
then subsequently try to allocate it. Another CPU will have raced in and
allocated the inode instead, so when iget5_locked gets the inode spin lock again
and does a search, it finds the new inode. So it goes ahead and calls
destroy_inode on the inode it just allocated. The problem is we don't set
BTRFS_I(inode)->root until the new inode is completely initialized. This patch
makes us set root to NULL when alloc'ing a new inode, so when we get to
btrfs_destroy_inode and we see that root is NULL we can just free up the memory
and continue on. This fixes the panic

http://www.kerneloops.org/submitresult.php?number=812690

Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a6dbd429

Btrfs: allow more metadata chunk preallocation · 33b25808

由 Chris Mason 提交于 11月 11, 2009

On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.

We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy.  The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata.  We need to start saying no and just force some
IO to happen.

But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

33b25808

Btrfs: fallback on uncompressed io if compressed io fails · f5a84ee3

由 Josef Bacik 提交于 11月 10, 2009

Currently compressed IO does not deal with not having its entire extent able to
be allocated. So if we have enough free space to allocate for the extent, but
its not contiguous, it will fail spectacularly. This patch fixes this by
falling back on uncompressed IO which lets us spread the delalloc extent across
multiple extents. I tested this by making us randomly think the reservation had
failed to make it fallback on the uncompressed io way and it seemed to work
fine. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f5a84ee3

Btrfs: find ideal block group for caching · ccf0e725

由 Josef Bacik 提交于 11月 10, 2009

This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.

Problem:

My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So

Solution:

1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.

2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.

There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.

With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ccf0e725

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功