提交 · 03a816b46d7eba78da11e4025f0af195b32fa464 · openeuler / Kernel

16 12月, 2009 6 次提交

nfsd: restrict filehandles accepted in V4ROOT case · 03a816b4

由 Steve Dickson 提交于 9月 09, 2009

On V4ROOT exports, only accept filehandles that are the *root* of some
export.  This allows mountd to allow or deny access to individual
directories and symlinks on the pseudofilesystem.

Note that the checks in readdir and lookup are not enough, since a
malicious host with access to the network could guess filehandles that
they weren't able to obtain through lookup or readdir.
Signed-off-by: NSteve Dickson <steved@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

03a816b4

nfsd: allow exports of symlinks · f2ca7153

由 J. Bruce Fields 提交于 11月 12, 2009

We want to allow exports of symlinks, to allow mountd to communicate to
the kernel which symlinks lead to exports, and hence which symlinks need
to be visible on the pseudofilesystem.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

f2ca7153

nfsd: filter readdir results in V4ROOT case · 3227fa41

由 J. Bruce Fields 提交于 10月 25, 2009

As with lookup, we treat every boject as a mountpoint and pretend it
doesn't exist if it isn't exported.

The preexisting code here is confusing, but I haven't yet figured out
how to make it clearer.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

3227fa41

nfsd: filter lookup results in V4ROOT case · 82ead7fe

由 J. Bruce Fields 提交于 10月 25, 2009

We treat every object as a mountpoint and pretend it doesn't exist if
it isn't exported.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

82ead7fe

nfsd4: don't continue "under" mounts in V4ROOT case · 3b6cee7b

由 J. Bruce Fields 提交于 10月 25, 2009

If /A/mount/point/ has filesystem "B" mounted on top of it, and if "A"
is exported, but not "B", then the nfs server has always returned to the
client a filehandle for the mountpoint, instead of for the root of "B",
allowing the client to see the subtree of "A" that would otherwise be
hidden by B.

Disable this behavior in the case of V4ROOT exports; we implement the
path restrictions of V4ROOT exports by treating *every* directory as if
it were a mountpoint, and allowing traversal *only* if the new directory
is exported.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

3b6cee7b

nfsd: introduce export flag for v4 pseudoroot · eb4c86c6

由 Steve Dickson 提交于 9月 09, 2009

NFSv4 differs from v2 and v3 in that it presents a single unified
filesystem tree, whereas v2 and v3 exported multiple filesystem (whose
roots could be found using a separate mount protocol).

Our original NFSv4 server implementation asked the administrator to
designate a single filesystem as the NFSv4 root, then to mount
filesystems they wished to export underneath.  (Often using bind mounts
of already-existing filesystems.)

This was conceptually simple, and allowed easy implementation, but
created a serious obstacle to upgrading between v2/v3: since the paths
to v4 filesystems were different, administrators would have to adjust
all the paths in client-side mount commands when switching to v4.

Various workarounds are possible.  For example, the administrator could
export "/" and designate it as the v4 root.  However, the security risks
of that approach are obvious, and in any case we shouldn't be requiring
the administrator to take extra steps to fix this problem; instead, the
server should present consistent paths across different versions by
default.

These patches take a modified version of that approach: we provide a new
export option which exports only a subset of a filesystem.  With this
flag, it becomes safe for mountd to export "/" by default, with no need
for additional configuration.

We begin just by defining the new flag.
Signed-off-by: NSteve Dickson <steved@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

eb4c86c6

15 12月, 2009 8 次提交

nfsd: let "insecure" flag vary by pseudoflavor · 12045a6e

由 J. Bruce Fields 提交于 12月 08, 2009

This was an oversight; it should be among the export flags that can be
allowed to vary by pseudoflavor.  This allows an administrator to (for
example) allow auth_sys mounts only from low ports, but allow auth_krb5
mounts to use any port.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

12045a6e

nfsd: new interface to advertise export features · e8e8753f

由 J. Bruce Fields 提交于 12月 14, 2009

Soon we will add the new V4ROOT flag, and allow the INSECURE flag to
vary by pseudoflavor. It would be useful for nfs-utils (for example,
for improved exportfs error reporting) to be able to know when this
happens. Use this new interface for that purpose.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

e8e8753f

nfsd: Move private headers to source directory · 9a74af21

由 Boaz Harrosh 提交于 12月 03, 2009

Lots of include/linux/nfsd/* headers are only used by
nfsd module. Move them to the source directory
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

9a74af21

vfs: nfsctl.c un-used nfsd #includes · 68590c38

由 Boaz Harrosh 提交于 12月 03, 2009

Only linux/nfsd/syscall.h is actually used. Remove the
other nfsd #includes, so they can be moved to source
directory.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

68590c38

lockd: Remove un-used nfsd headers #includes · 0296f55f

由 Boaz Harrosh 提交于 12月 03, 2009

In what history where these ever needed? Well not
any more.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

0296f55f

compat.c: Remove dependence on nfsd private headers · 370d5600

由 Boaz Harrosh 提交于 12月 03, 2009

Two nfsd related headers where included but never actually
used. The linux/nfsd/nfsd.h file will eventually be moved
to fs/nfsd directory as it is only needed by nfsd itself.

There are 3 more compat.c files in the Kernel at other ARCHs
that wrongly #include nfsd headers. Once these are fixed the
headers can be moved.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

370d5600

nfsd: Source files #include cleanups · 341eb184

由 Boaz Harrosh 提交于 12月 03, 2009

Now that the headers are fixed and carry their own wait, all fs/nfsd/
source files can include a minimal set of headers. and still compile just
fine.

This patch should improve the compilation speed of the nfsd module.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

341eb184

nfsd4: fix share mode permissions · 57ecb34f

由 J. Bruce Fields 提交于 12月 01, 2009

NFSv4 opens may function as locks denying other NFSv4 users the rights
to open a file.

We're requiring a user to have write permissions before they can deny
write.  We're *not* requiring a user to have write permissions to deny
read, which is if anything a more drastic denial.

What was intended was to require write permissions for DENY_READ.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

57ecb34f

26 11月, 2009 1 次提交

nfsd: simplify fh_verify access checks · 864f0f61

由 J. Bruce Fields 提交于 11月 25, 2009

All nfsd security depends on the security checks in fh_verify, and
especially on nfsd_setuser().

It therefore bothers me that the nfsd_setuser call may be made from
three different places, depending on whether the filehandle has already
been mapped to a dentry, and on whether subtreechecking is in force.

Instead, make an unconditional call in fh_verify(), so it's trivial to
verify that the call always occurs.

That leaves us with a redundant nfsd_setuser() call in the subtreecheck
case--it needs the correct user set earlier in order to check execute
permissions on the path to this filehandle--but I'm willing to accept
that minor inefficiency in the subtreecheck case in return for more
straightforward permission checking.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

864f0f61

18 11月, 2009 4 次提交

fcntl: rename F_OWNER_GID to F_OWNER_PGRP · 978b4053

由 Peter Zijlstra 提交于 11月 17, 2009

This is for consistency with various ioctl() operations that include the
suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
used with setpriority() and getpriority().  Also, using PGRP instead of
GID avoids confusion with the common abbreviation of "group ID".

I'm fine with anything that makes it more consistent, and if PGRP is what
is the predominant abbreviation then I see no need to further confuse
matters by adding a third one.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

978b4053

procfs: fix /proc/<pid>/stat stack pointer for kernel threads · 9ebd4eba

由 Stefani Seibold 提交于 11月 17, 2009

Fix a small issue for the stack pointer in /proc/<pid>/stat.  In case of a
kernel thread the value of the printed stack pointer should be 0.
Signed-off-by: NStefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ebd4eba

xfs: copy li_lsn before dropping AIL lock · 6c06f072

由 Nathaniel W. Turner 提交于 11月 16, 2009

Access to log items on the AIL is generally protected by m_ail_lock;
this is particularly needed when we're getting or setting the 64-bit
li_lsn on a 32-bit platform.  This patch fixes a couple places where we
were accessing the log item after dropping the AIL lock on 32-bit
machines.

This can result in a partially-zeroed log->l_tail_lsn if
xfs_trans_ail_delete is racing with xfs_trans_ail_update, and in at
least some cases, this can leave the l_tail_lsn with a zero cycle
number, which means xlog_space_left will think the log is full (unless
CONFIG_XFS_DEBUG is set, in which case we'll trip an ASSERT), leading to
processes stuck forever in xlog_grant_log_space.

Thanks to Adrian VanderSpek for first spotting the race potential and to
Dave Chinner for debug assistance.
Signed-off-by: NNathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

6c06f072

XFS bug in log recover with quota (bugzilla id 855) · 8ec6dba2

由 Jan Rekorajski 提交于 11月 16, 2009

Hi,
I was hit by a bug in linux 2.6.31 when XFS is not able to recover the
log after a crash if fs was mounted with quotas. Gory details in XFS
bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855.

It looks like wrong struct is used in buffer length check, and the following
patch should fix the problem.

xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes
long, and this is exactly what I see in system logs - "XFS: dquot too small
(104) in xlog_recover_do_dquot_trans."
Signed-off-by: NJan Rekorajski <baggins@sith.mimuw.edu.pl>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

8ec6dba2

16 11月, 2009 1 次提交

cifs: clear server inode number flag while autodisabling · f534dc99

由 Suresh Jayaraman 提交于 11月 16, 2009

Fix the commit ec06aedd that intended to turn off querying for server inode
numbers when server doesn't consistently support inode numbers. Presumably
the commit didn't actually clear the CIFS_MOUNT_SERVER_INUM flag, perhaps a
typo.
Signed-off-by: NSuresh Jayaraman <sjayaraman@suse.de>
Acked-by: NJeff Layton <jlayton@redhat.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

f534dc99

15 11月, 2009 2 次提交

nilfs2: deleted inconsistent comment in nilfs_load_inode_block() · 18dafac1

由 Jiro SEKIBA 提交于 11月 15, 2009

The comment says, "Caller of this function MUST lock s_inode_lock",
however just above the comment, it locks s_inode_lock in the function.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

18dafac1

Fix memory corruption caused by nfsd readdir+ · 479c2553

由 Petr Vandrovec 提交于 11月 14, 2009

Commit 8177e6d6 ("nfsd: clean up
readdirplus encoding") introduced single character typo in nfs3 readdir+
implementation.  Unfortunately that typo has quite bad side effects:
random memory corruption, followed (on my box) with immediate
spontaneous box reboot.

Using 'p1' instead of 'p' fixes my Linux box rebooting whenever VMware
ESXi box tries to list contents of my home directory.
Signed-off-by: NPetr Vandrovec <petr@vandrovec.name>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

479c2553

14 11月, 2009 1 次提交

nfsd: make fs/nfsd/vfs.h for common includes · 0a3adade

由 J. Bruce Fields 提交于 11月 04, 2009

None of this stuff is used outside nfsd, so move it out of the common
linux include directory.

Actually, probably none of the stuff in include/linux/nfsd/nfsd.h really
belongs there, so later we may remove that file entirely.
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

0a3adade

13 11月, 2009 1 次提交

nilfs2: fix lock order reversal in chcp operation · c1ea985c

由 Ryusuke Konishi 提交于 11月 12, 2009

Will fix the following lock order reversal lockdep detected:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc6 #7
-------------------------------------------------------
chcp/30157 is trying to acquire lock:
 (&nilfs->ns_mount_mutex){+.+.+.}, at: [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]

but task is already holding lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<fed7ca32>] nilfs_transaction_begin+0xba/0x110 [nilfs2]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&nilfs->ns_segctor_sem){++++.+}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c14151e2>] down_read+0x31/0x45
       [<fed6d77b>] nilfs_attach_checkpoint+0x8f/0x16b [nilfs2]
       [<fed6e393>] nilfs_get_sb+0x3e7/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #1 (&type->s_umount_key#31/1){+.+.+.}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c104c0f3>] down_write_nested+0x34/0x52
       [<c10c08fe>] sget+0x22e/0x389
       [<fed6e133>] nilfs_get_sb+0x187/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #0 (&nilfs->ns_mount_mutex){+.+.+.}:
       [<c1057727>] __lock_acquire+0xe27/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c1414d63>] mutex_lock_nested+0x41/0x23e
       [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]
       [<fed801b2>] nilfs_ioctl+0x11a/0x7da [nilfs2]
       [<c10cca12>] vfs_ioctl+0x27/0x6e
       [<c10ccf93>] do_vfs_ioctl+0x491/0x4db
       [<c10cd022>] sys_ioctl+0x45/0x5f
       [<c1002a14>] sysenter_do_call+0x12/0x32
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

c1ea985c

12 11月, 2009 15 次提交

__generic_block_fiemap(): fix for files bigger than 4GB · e04b5ef8

由 Mike Hommey 提交于 11月 11, 2009

Because of an integer overflow on start_blk, various kind of wrong results
would be returned by the generic_block_fiemap() handler, such as no
extents when there is a 4GB+ hole at the beginning of the file, or wrong
fe_logical when an extent starts after the first 4GB.
Signed-off-by: NMike Hommey <mh@glandium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@sgi.com>
Cc: Josef Bacik <jbacik@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e04b5ef8

exec: setup_arg_pages() fails to return errors · fc63cf23

由 Anton Blanchard 提交于 11月 11, 2009

In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.

Also remove a now duplicated exit path and branch to out_unlock instead.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc63cf23

fs: add missing compat_ptr handling for FS_IOC_RESVSP ioctl · 7779d7be

由 Heiko Carstens 提交于 11月 11, 2009

For FS_IOC_RESVSP and FS_IOC_RESVSP64 compat_sys_ioctl() uses its
arg argument as a pointer to userspace. However it is missing a
a call to compat_ptr() which will do a proper pointer conversion.

This was introduced with 3e63cbb1 "fs: Add new pre-allocation ioctls
to vfs for compatibility with legacy xfs ioctls".
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ankit Jain <me@ankitjain.org>
Acked-by: NChristoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: NArnd Bergmann <arndbergmann@googlemail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org>		[2.6.31.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7779d7be

pidns: fix a leak in /proc dentries and inodes with pid namespaces. · 29f12ca3

由 Sukadev Bhattiprolu 提交于 11月 11, 2009

Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

	http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them.  As a part of reaping, the container-init should flush any /proc
dentries associated with the processes.  But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING.  At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed.  But as
pointed out by Eric Biederman, after commit 0feae5c4,
shrink_dcache_parent() no longer affects other filesystems.  So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

	echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.
Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
Reported-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NJan Kara <jack@ucw.cz>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29f12ca3

fs/jbd: Export log_start_commit to fix ext3 build. · ff5e4b51

由 Stefan Schmidt 提交于 11月 12, 2009

This fixes:
ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined!
Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>

ff5e4b51

Btrfs: fix panic when trying to destroy a newly allocated · a6dbd429

由 Josef Bacik 提交于 11月 11, 2009

There is a problem where iget5_locked will look for an inode, not find it, and
then subsequently try to allocate it. Another CPU will have raced in and
allocated the inode instead, so when iget5_locked gets the inode spin lock again
and does a search, it finds the new inode. So it goes ahead and calls
destroy_inode on the inode it just allocated. The problem is we don't set
BTRFS_I(inode)->root until the new inode is completely initialized. This patch
makes us set root to NULL when alloc'ing a new inode, so when we get to
btrfs_destroy_inode and we see that root is NULL we can just free up the memory
and continue on. This fixes the panic

http://www.kerneloops.org/submitresult.php?number=812690

Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a6dbd429

Btrfs: allow more metadata chunk preallocation · 33b25808

由 Chris Mason 提交于 11月 11, 2009

On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.

We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy.  The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata.  We need to start saying no and just force some
IO to happen.

But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

33b25808

Btrfs: fallback on uncompressed io if compressed io fails · f5a84ee3

由 Josef Bacik 提交于 11月 10, 2009

Currently compressed IO does not deal with not having its entire extent able to
be allocated. So if we have enough free space to allocate for the extent, but
its not contiguous, it will fail spectacularly. This patch fixes this by
falling back on uncompressed IO which lets us spread the delalloc extent across
multiple extents. I tested this by making us randomly think the reservation had
failed to make it fallback on the uncompressed io way and it seemed to work
fine. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f5a84ee3

Btrfs: find ideal block group for caching · ccf0e725

由 Josef Bacik 提交于 11月 10, 2009

This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.

Problem:

My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So

Solution:

1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.

2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.

There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.

With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ccf0e725

Btrfs: avoid null deref in unpin_extent_cache() · 4eb3991c

由 Dan Carpenter 提交于 11月 10, 2009

I re-orderred the checks to avoid dereferencing "em" if it was null.

Found by smatch static checker.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4eb3991c

Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root · df66916e

由 Li Dongyang 提交于 11月 06, 2009

We don't need to call btrfs_release_path because btrfs_free_path will do
that for us.
Signed-off-by: NLi Dongyang <Jerry87905@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

df66916e

Btrfs: fix some metadata enospc issues · 5df6a9f6

由 Josef Bacik 提交于 11月 10, 2009

We weren't reserving metadata space for rename, rmdir and unlink, which could
cause problems.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5df6a9f6

Btrfs: fix how we set max_size for free space clusters · 01dea1ef

由 Josef Bacik 提交于 11月 10, 2009

This patch fixes a problem where max_size can be set to 0 even though we
filled the cluster properly. We set max_size to 0 if we restart the cluster
window, but if the new start entry is big enough to be our new cluster then we
could return with a max_size set to 0, which will mean the next time we try to
allocate from this cluster it will fail. So set max_extent to the entry's
size. Tested this on my box and now we actually allocate from the cluster
after we fill it. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

01dea1ef

Btrfs: cleanup transaction starting and fix journal_info usage · 249ac1e5

由 Josef Bacik 提交于 11月 10, 2009

We use journal_info to tell if we're in a nested transaction to make sure we
don't commit the transaction within a nested transaction. We use another
method to see if there are any outstanding ioctl trans handles, so if we're
starting one do not set current->journal_info, since it will screw with other
filesystems. This patch also cleans up the starting stuff so there aren't any
magic numbers.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

249ac1e5

Btrfs: fix data allocation hint start · 6346c939

由 Josef Bacik 提交于 11月 10, 2009

Sometimes our start allocation hint when we cow a file can be either
EXTENT_HOLE or some other such place holder, which is not optimal. So if we
find that our em->block_start is one of these special values, check to see
where the first block of the inode is stored, and use that as a hint. If that
block is also a special value, just fallback on a hint of 0 and let the
allocator figure out a good place to put the data.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6346c939

11 11月, 2009 1 次提交

JBD/JBD2: free j_wbuf if journal init fails. · 7b02bec0

由 Tao Ma 提交于 11月 10, 2009

If journal init fails, we need to free j_wbuf.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJan Kara <jack@suse.cz>

7b02bec0

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功