提交 · 240f62c8756df285da11469259b3900f32883168 · openeuler / Kernel

28 3月, 2011 1 次提交

Btrfs: use RCU instead of a spinlock to protect the root node · 240f62c8

由 Chris Mason 提交于 13年前

The pointer to the extent buffer for the root of each tree
is protected by a spinlock so that we can safely read the pointer
and take a reference on the extent buffer.

But now that the extent buffers are freed via RCU, we can safely
use rcu_read_lock instead.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

240f62c8

26 3月, 2011 3 次提交

Btrfs: mark the bio with an error if we have a failure in dio · c0da7aa1

由 Josef Bacik 提交于 13年前

I noticed that dio_end_io calls the appropriate endio function with an error,
but the endio functions don't actually do anything with that error, they assume
that if there was an error then the bio will not be uptodate. So if we had
checksum failures we would never pass back EIO. So if there is an error in our
endio functions make sure to clear the uptodate flag on the bio. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

c0da7aa1

Btrfs: don't allocate dip->csums when doing writes · 98bc3149

由 Josef Bacik 提交于 13年前

When doing direct writes we store the checksums in the ordered sum stuff in the
ordered extent for writing them when the write completes, so we don't even use
the dip->csums array. So if we're writing, don't bother allocating dip->csums
since we won't use it anyway. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

98bc3149

Btrfs: cleanup how we setup free space clusters · 4e69b598

由 Josef Bacik 提交于 13年前

This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it. Currently we
either want to refill a cluster with

1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space

The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use. So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.

This also fixes a suboptimal thing we would do with bitmaps. If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation. Instead of doing that now we just add it to the clusters
group.

I tested this with my ENOSPC tests and xfstests and it survived.
Signed-off-by: NJosef Bacik <josef@redhat.com>

4e69b598

21 3月, 2011 3 次提交

Btrfs: don't be as aggressive about using bitmaps · 32cb0840

由 Josef Bacik 提交于 13年前

We have been creating bitmaps for small extents unconditionally forever. This
was great when testing to make sure the bitmap stuff was working, but is
overkill normally. So instead of always adding small chunks of free space to
bitmaps, only start doing it if we go past half of our extent threshold. This
will keeps us from creating a bitmap for just one small free extent at the front
of the block group, and will make the allocator a little faster as a result.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

32cb0840

Btrfs: deal with min_bytes appropriately when looking for a cluster · d0a365e8

由 Josef Bacik 提交于 13年前

We do all this fun stuff with min_bytes, but either don't use it in the case of
just normal extents, or use it completely wrong in the case of bitmaps.  So fix
this for both cases

1) In the extent case, stop looking for space with window_free >= min_bytes
instead of bytes + empty_size.

2) In the bitmap case, we were looking for streches of free space that was at
least min_bytes in size, which was not right at all.  So instead search for
stretches of free space that are at least bytes in size (this will make a
difference when we have > page size blocks) and then only search for min_bytes
amount of free space.

Thanks,
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <josef@redhat.com>

d0a365e8

Btrfs: check free space in block group before searching for a cluster · 7d0d2e8e

由 Josef Bacik 提交于 13年前

The free space cluster stuff is heavy duty, so there is no sense in going
through the entire song and dance if there isn't enough space in the block group
to begin with.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

7d0d2e8e

18 3月, 2011 16 次提交

Btrfs: add checks to verify dir items are correct · 22a94d44

由 Josef Bacik 提交于 13年前

We need to make sure the dir items we get are valid dir items.  So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

22a94d44

Btrfs: check return value of btrfs_search_slot properly · 41415730

由 Josef Bacik 提交于 13年前

Doing an audit of where we use btrfs_search_slot only showed one place where we
don't check the return value of btrfs_search_slot properly.  Just fix
mark_extent_written to see if btrfs_search_slot failed and act accordingly.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

41415730

Btrfs: check items for correctness as we search · a826d6dc

由 Josef Bacik 提交于 13年前

Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for. If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either. This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

a826d6dc

Btrfs: return error if the range we want to map is bogus · 85026533

由 Josef Bacik 提交于 13年前

Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit.  Fix this.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

85026533

Btrfs: add a comment explaining what btrfs_cont_expand does · 695a0d0d

由 Josef Bacik 提交于 14年前

Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes
trying to remember what exactly it does and why the hell we need it. So add a
comment to save future-Josef some time. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

695a0d0d

Btrfs: use mark_inode_dirty when expanding the file · 930f028a

由 Josef Bacik 提交于 14年前

Mark_inode_dirty will call btrfs_dirty_inode which will take care of updating
the inode.  This makes setsize a little cleaner since we don't have to start a
transaction and update the inode in there, we can just call mark_inode_dirty.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

930f028a

Btrfs: only add orphan items when truncating · f0cd846e

由 Josef Bacik 提交于 14年前

We don't need an orphan item when expanding files, we just need them for
truncating them, so only add the orphan item in btrfs_truncate instead of in
btrfs_setsize.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

f0cd846e

Btrfs: make sure to remove the orphan item from the in-memory list · ded5db9d

由 Josef Bacik 提交于 14年前

This fixes a problem where if truncate fails the inode will still be on the in
memory orphan list. This is will make us complain when the inode gets destroyed
because it's still on the orphan list. So if we fail just remove us from the in
memory list and carry on.
Signed-off-by: NJosef Bacik <josef@redhat.com>

ded5db9d

Btrfs: handle errors in btrfs_orphan_cleanup · 66b4ffd1

由 Josef Bacik 提交于 14年前

If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

66b4ffd1

Btrfs: cleanup error handling in the truncate path · 3893e33b

由 Josef Bacik 提交于 14年前

Now that we can handle having errors in the truncate path lets make sure we
return errors instead of doing BUG_ON() and such.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

3893e33b

Btrfs: convert to the new truncate sequence · a41ad394

由 Josef Bacik 提交于 14年前

->truncate() is going away, instead all of the work needs to be done in
->setattr().  So this converts us over to do this.  It's fairly straightforward,
just get rid of our .truncate inode operation and call btrfs_truncate() directly
from btrfs_setsize.  This works out better for us since truncate can technically
return ENOSPC, and before we had no way of letting anybody know.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

a41ad394

Btrfs: use a slab for the free space entries · dc89e982

由 Josef Bacik 提交于 14年前

Since we alloc/free free space entries a whole lot, lets use a slab to keep
track of them. This makes some of my tests slightly faster. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

dc89e982

Btrfs: change reserved_extents to an atomic_t · 57a45ced

由 Josef Bacik 提交于 14年前

We track delayed allocation per inodes via 2 counters, one is
outstanding_extents and reserved_extents. Outstanding_extents is already an
atomic_t, but reserved_extents is not and is protected by a spinlock. So
convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
reduces locking overhead (albiet it was minimal to begin with). Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

57a45ced

Btrfs: fix how we deal with the pages array in the write path · 4a64001f

由 Josef Bacik 提交于 14年前

Really we don't need to memset the pages array at all, since we know how many
pages we're going to use in the array and pass that around. So don't memset,
just trust we're not idiots and we pass num_pages around properly.
Signed-off-by: NJosef Bacik <josef@redhat.com>

4a64001f

Btrfs: simplify our write path · d0215f3e

由 Josef Bacik 提交于 14年前

Our aio_write function is huge and kind of hard to follow at times. So this
patch fixes this by breaking out the buffered and direct write paths out into
seperate functions so it's a little clearer what's going on. I've also fixed
some wrong typing that we had and added the ability to handle getting an error
back from btrfs_set_extent_delalloc. Tested this with xfstests and everything
came out fine. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

d0215f3e

Btrfs: fix formatting in file.c · 9f570b8d

由 Josef Bacik 提交于 14年前

Sorry, but these were bugging me.  Just cleanup some of the formatting in
file.c.
Signed-off-by: NJosef Bacik <josef@redhat.com>

9f570b8d

15 3月, 2011 1 次提交

Fix corrupted OSF partition table parsing · 1eafbfeb

由 Timo Warns 提交于 13年前

The kernel automatically evaluates partition tables of storage devices.
The code for evaluating OSF partitions contains a bug that leaks data
from kernel heap memory to userspace for certain corrupted OSF
partitions.

In more detail:

  for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {

iterates from 0 to d_npartitions - 1, where d_npartitions is read from
the partition table without validation and partition is a pointer to an
array of at most 8 d_partitions.

Add the proper and obvious validation.
Signed-off-by: NTimo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
[ Changed the patch trivially to not repeat the whole le16_to_cpu()
  thing, and to use an explicit constant for the magic value '8' ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1eafbfeb

14 3月, 2011 1 次提交

compat breakage in preadv() and pwritev() · c44ed965

由 Al Viro 提交于 13年前

Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_...  ones forget to check FMODE_P{READ,WRITE}, so
e.g.  on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.

Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c44ed965

12 3月, 2011 7 次提交

Btrfs: break out of shrink_delalloc earlier · 36e39c40

由 Chris Mason 提交于 13年前

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations.  The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made.  This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down.  This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

36e39c40

NFS: NFSROOT should default to "proto=udp" · 53d47375

由 Chuck Lever 提交于 13年前

There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.

Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing".  Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.

TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.

When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation.  Usually user space knows how
and when to retry it.  The network layer does not report a distinct
error code for this particular failure mode.  Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.

Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.

To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.

root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists.  This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.
Tested-by: NBrian Downing <bdowning@lavos.net>
Tested-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> # 2.6.37
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

53d47375

nfs4: remove duplicated #include · 57df216b

由 Huang Weiyi 提交于 14年前

Remove duplicated #include('s) in
  fs/nfs/nfs4proc.c
Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

57df216b

NFSv4: nfs4_state_mark_reclaim_nograce() should be static · f9feab1e

由 Trond Myklebust 提交于 14年前

There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

f9feab1e

T
NFSv4: Fix the setlk error handler · ecac799a
由 Trond Myklebust 提交于 14年前
```
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
```
ecac799a

NFSv4.1: Fix the handling of the SEQUENCE status bits · b4410c2f

由 Trond Myklebust 提交于 14年前

We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

b4410c2f

NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses · 0400a6b0

由 Trond Myklebust 提交于 14年前

nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().

This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

0400a6b0

11 3月, 2011 8 次提交

NFSv4.1 reclaim complete must wait for completion · c34c32ea

由 Andy Adamson 提交于 14年前

Signed-off-by: NAndy Adamson <andros@netapp.com>
[Trond: fix whitespace errors]
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c34c32ea

NFSv4: remove duplicate clientid in struct nfs_client · 114f64b5

由 Andy Adamson 提交于 14年前

Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

114f64b5

NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY · 7d6d63d6

由 Ricardo Labiaga 提交于 14年前

Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid.  Instead, delay and retry the CREATE_SESSION
call.
Signed-off-by: NRicardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

7d6d63d6

(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31... · 3fa0b4e2

由 Frank Filz 提交于 14年前

(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid

The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.
Signed-off-by: NFrank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

3fa0b4e2

nfs: fix compilation warning · 43b7c3f0

由 Jovi Zhang 提交于 14年前

this commit fix compilation warning as following:
linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: NJovi Zhang <bookjovi@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

43b7c3f0

nfs: add kmalloc return value check in decode_and_add_ds · b9f81057

由 Stanislav Fomichev 提交于 14年前

add kmalloc return value check in decode_and_add_ds
Signed-off-by: NStanislav Fomichev <kernel@fomichev.me>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

b9f81057

nfs: close NFSv4 COMMIT vs. CLOSE race · d2224e7a

由 Jeff Layton 提交于 14年前

I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:

	mkdir("DIR");
	fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
	write(fd, "abcdefg", 7);
	close(fd);
	unlink("DIR/FILE");
	rmdir("DIR");

The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.

nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.

In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.

Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

d2224e7a

SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41

由 Trond Myklebust 提交于 14年前

Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.

For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.

This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.

In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.

Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

bf294b41

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功