提交 · 2c845f5a5f238f42376b6551a7f7716952c8f509 · openeuler / raspberrypi-kernel

23 9月, 2014 2 次提交

xfs: track collapse via file offset rather than extent index · 2c845f5a

由 Brian Foster 提交于 10年前

The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.

To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2c845f5a

xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly · 0d085a52

由 Dave Chinner 提交于 10年前

XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0d085a52

02 9月, 2014 7 次提交

xfs: trim eofblocks before collapse range · 41b9d726

由 Brian Foster 提交于 10年前

xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.

The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.

As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

41b9d726

xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca

由 Dave Chinner 提交于 10年前

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

1669a8ca

xfs: don't log inode unless extent shift makes extent modifications · ca446d88

由 Brian Foster 提交于 10年前

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ca446d88

xfs: use ranged writeback and invalidation for direct IO · 7d4ea3ce

由 Dave Chinner 提交于 10年前

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

7d4ea3ce

xfs: don't zero partial page cache pages during O_DIRECT writes · 834ffca6

由 Dave Chinner 提交于 10年前

Similar to direct IO reads, direct IO writes are using 
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

834ffca6

xfs: don't zero partial page cache pages during O_DIRECT writes · 85e584da

由 Chris Mason 提交于 10年前

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads.  This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page.  This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: NChris Mason <clm@fb.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

85e584da

xfs: don't dirty buffers beyond EOF · 22e757a4

由 Dave Chinner 提交于 10年前

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

22e757a4

25 8月, 2014 1 次提交

aio: fix reqs_available handling · d856f32a

由 Benjamin LaHaise 提交于 10年前

As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer.  Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring.  So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available().  The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
Reported-by: NDan Aloni <dan@kernelim.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Acked-by: NDan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d856f32a

23 8月, 2014 8 次提交

nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait · 92a56555

由 David Jeffery 提交于 10年前

If a SIGKILL is sent to a task waiting in __nfs_iocounter_wait,
it will busy-wait or soft lockup in its while loop.
nfs_wait_bit_killable won't sleep, and the loop won't exit on
the error return.

Stop the busy-wait by breaking out of the loop when
nfs_wait_bit_killable returns an error.
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

92a56555

nfs: can_coalesce_requests must enforce contiguity · 78270e8f

由 Weston Andros Adamson 提交于 10年前

Commit 6094f838
"nfs: allow coalescing of subpage requests" got rid of the requirement
that requests cover whole pages, but it made some incorrect assumptions.

It turns out that callers of this interface can map adjacent requests
(by file position as seen by req_offset + req->wb_bytes) to different pages,
even when they could share a page. An example is the direct I/O interface -
iov_iter_get_pages_alloc may return one segment with a partial page filled
and the next segment (which is adjacent in the file position) starts with a
new page.
Reported-by: NToralf Förster <toralf.foerster@gmx.de>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

78270e8f

nfs: disallow duplicate pages in pgio page vectors · bba5c188

由 Weston Andros Adamson 提交于 10年前

Adjacent requests that share the same page are allowed, but should only
use one entry in the page vector. This avoids overruning the page
vector - it is sized based on how many bytes there are, not by
request count.

This fixes issues that manifest as "Redzone overwritten" bugs (the
vector overrun) and hangs waiting on page read / write, as it waits on
the same page more than once.

This also adds bounds checking to the page vector with a graceful failure
(WARN_ON_ONCE and pgio error returned to application).
Reported-by: NToralf Förster <toralf.foerster@gmx.de>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

bba5c188

nfs: don't sleep with inode lock in lock_and_join_requests · 7c3af975

由 Weston Andros Adamson 提交于 10年前

This handles the 'nonblock=false' case in nfs_lock_and_join_requests.
If the group is already locked and blocking is allowed, drop the inode lock
and wait for the group lock to be cleared before trying it all again.
This should fix warnings found in peterz's tree (sched/wait branch), where
might_sleep() checks are added to wait.[ch].
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Reviewed-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

7c3af975

nfs: fix error handling in lock_and_join_requests · 94970014

由 Weston Andros Adamson 提交于 10年前

This fixes handling of errors from nfs_page_group_lock in
nfs_lock_and_join_requests.  It now releases the inode lock and the
reference to the head request.
Reported-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Reviewed-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

94970014

nfs: use blocking page_group_lock in add_request · bfd484a5

由 Weston Andros Adamson 提交于 10年前

__nfs_pageio_add_request was calling nfs_page_group_lock nonblocking, but
this can return -EAGAIN which would end up passing -EIO to the application.

There is no reason not to block in this path, so change the two calls to
do so. Also, there is no need to check the return value of
nfs_page_group_lock when nonblock=false, so remove the error handling code.
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Reviewed-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

bfd484a5

nfs: fix nonblocking calls to nfs_page_group_lock · bc8a309e

由 Weston Andros Adamson 提交于 10年前

nfs_page_group_lock was calling wait_on_bit_lock even when told not to
block. Fix by first trying test_and_set_bit, followed by wait_on_bit_lock
if and only if blocking is allowed. Return -EAGAIN if nonblocking and the
test_and_set of the bit was already locked.
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Reviewed-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

bc8a309e

nfs: change nfs_page_group_lock argument · fd2f3a06

由 Weston Andros Adamson 提交于 10年前

Flip the meaning of the second argument from 'wait' to 'nonblock' to
match related functions. Update all five calls to reflect this change.
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Reviewed-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

fd2f3a06

20 8月, 2014 3 次提交

ext3: Count internal journal as bsddf overhead in ext3_statfs · e6d8fb34

由 Chin-Tsung Cheng 提交于 10年前

The journal blocks of external journal device should not
be counted as overhead.
Signed-off-by: NChin-Tsung Cheng <chintzung@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>

e6d8fb34

isofs: Fix unbounded recursion when processing relocated directories · 410dd3cf

由 Jan Kara 提交于 10年前

We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).

Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.

CC: stable@vger.kernel.org
Reported-by: NChris Evans <cevans@google.com>
Signed-off-by: NJan Kara <jack@suse.cz>

410dd3cf

udf: avoid unneeded up_write when fail to add entry in ->symlink · 85cd083b

由 Chao Yu 提交于 10年前

We have released the ->i_data_sem before invoking udf_add_entry(),
so in following error path, we should not release this lock again.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJan Kara <jack@suse.cz>

85cd083b

18 8月, 2014 3 次提交

[SMB3] Enable fallocate -z support for SMB3 mounts · 30175628

由 Steve French 提交于 10年前

fallocate -z (FALLOC_FL_ZERO_RANGE) can map to SMB3
FSCTL_SET_ZERO_DATA SMB3 FSCTL but FALLOC_FL_ZERO_RANGE
when called without the FALLOC_FL_KEEPSIZE flag set could want
the file size changed so we can not support that subcase unless
the file is cached (and thus we know the file size).
Signed-off-by: NSteve French <smfrench@gmail.com>
Reviewed-by: NPavel Shilovsky <pshilovsky@samba.org>

30175628

enable fallocate punch hole ("fallocate -p") for SMB3 · 31742c5a

由 Steve French 提交于 10年前

Implement FALLOC_FL_PUNCH_HOLE (which does not change the file size
fortunately so this matches the behavior of the equivalent SMB3
fsctl call) for SMB3 mounts.  This allows "fallocate -p" to work.
It requires that the server support setting files as sparse
(which Windows allows).
Signed-off-by: NSteve French <smfrench@gmail.com>

31742c5a

Incorrect error returned on setting file compressed on SMB2 · ad3829cf

由 Steve French 提交于 10年前

When the server (for an SMB2 or SMB3 mount) doesn't support
an ioctl (such as setting the compressed flag
on a file) we were incorrectly returning EIO instead
of EOPNOTSUPP, this is confusing e.g. doing chattr +c to a file
on a non-btrfs Samba partition, now the error returned is more
intuitive to the user.  Also fixes error mapping on setting
hardlink to servers which don't support that.
Signed-off-by: NSteve French <smfrench@gmail.com>
Reviewed-by: NDavid Disseldorp <ddiss@suse.de>

ad3829cf

17 8月, 2014 3 次提交

CIFS: Fix wrong directory attributes after rename · b46799a8

由 Pavel Shilovsky 提交于 10年前

When we requests rename we also need to update attributes
of both source and target parent directories. Not doing it
causes generic/309 xfstest to fail on SMB2 mounts. Fix this
by marking these directories for force revalidating.

Cc: <stable@vger.kernel.org>
Signed-off-by: NPavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: NSteve French <smfrench@gmail.com>

b46799a8

CIFS: Fix SMB2 readdir error handling · 52755808

由 Pavel Shilovsky 提交于 10年前

SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.

Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.

Cc: <stable@vger.kernel.org>
Signed-off-by: NPavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: NSteve French <smfrench@gmail.com>

52755808

[CIFS] Possible null ptr deref in SMB2_tcon · 18f39e7b

由 Steve French 提交于 10年前

As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
and there is one path in which tcon can be null.
Signed-off-by: NSteve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org> # v3.7+
Reported-by: NRaphael Geissert <geissert@debian.org>

18f39e7b

16 8月, 2014 3 次提交

[CIFS] Workaround MacOS server problem with SMB2.1 write · 754789a1

由 Steve French 提交于 10年前

 response

Writes fail to Mac servers with SMB2.1 mounts (works with cifs though) due
to them sending an incorrect RFC1001 length for the SMB2.1 Write response.
Workaround this problem. MacOS server sends a write response with 3 bytes
of pad beyond the end of the SMB itself.  The RFC1001 length is 3 bytes
more than the sum of the SMB2.1 header length + the write reponse.

Incorporate feedback from Jeff and JRA to allow servers to send
a tcp frame that is even more than three bytes too long
(ie much longer than the SMB2/SMB3 request that it contains) but
we do log it once now. In the earlier version of the patch I had
limited how far off the length field could be before we fail the request.
Signed-off-by: NSteve French <smfrench@gmail.com>

754789a1

cifs: handle lease F_UNLCK requests properly · 02440806

由 Jeff Layton 提交于 10年前

Currently any F_UNLCK request for a lease just gets back -EAGAIN. Allow
them to go immediately to generic_setlease instead.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

02440806

Cleanup sparse file support by creating worker function for it · d43cc793

由 Steve French 提交于 10年前

Simply move code to new function (for clarity). Function sets or clears
the sparse file attribute flag.
Signed-off-by: NSteve French <smfrench@gmail.com>
Reviewed-by: NDavid Disseldorp <ddiss@samba.org>

d43cc793

15 8月, 2014 9 次提交

btrfs: disable strict file flushes for renames and truncates · 8d875f95

由 Chris Mason 提交于 10年前

Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.
Signed-off-by: NChris Mason <clm@fb.com>

8d875f95

Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812

由 Filipe Manana 提交于 10年前

Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

27b9a812

Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d

由 Takashi Iwai 提交于 10年前

We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
 BUG: unable to handle kernel NULL pointer dereference at 00000004
 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
 *pde = 00000000
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
 task: f1478130 ti: f147c000 task.ti: f147c000
 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
 EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
 EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
 Stack:
  00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
  00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
  00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
 Call Trace:
  [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
  [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
  [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
  [<c025e38b>] process_one_work+0x11b/0x390
  [<c025eea1>] worker_thread+0x101/0x340
  [<c026432b>] kthread+0x9b/0xb0
  [<c0712a71>] ret_from_kernel_thread+0x21/0x30
  [<c0264290>] kthread_create_on_node+0x110/0x110

This indicates a NULL corruption in prefs_delayed list.  The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.

ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux.  The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux.  That is, the function overwrites
64bit value on 32bit pointer.  This caused a NULL in the adjacent
variable, in this case, prefs_delayed.

Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64.  There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly.  But, it's
safer than explicit cast to u64, anyway.

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: NTakashi Iwai <tiwai@suse.de>
Signed-off-by: NChris Mason <clm@fb.com>

4eb1f66d

Btrfs: fix compressed write corruption on enospc · ce62003f

由 Liu Bo 提交于 10年前

When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
Signed-off-by: NChris Mason <clm@fb.com>

ce62003f

btrfs: correctly handle return from ulist_add · f90e579c

由 Mark Fasheh 提交于 10年前

ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Signed-off-by: NChris Mason <clm@fb.com>

f90e579c

btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a

由 Mark Fasheh 提交于 10年前

During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

1152651a

Btrfs: read lock extent buffer while walking backrefs · 6f7ff6d7

由 Filipe Manana 提交于 10年前

Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

6f7ff6d7

Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0

由 Josef Bacik 提交于 10年前

Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

e339a6b0

btrfs: adjust statfs calculations according to raid profiles · ba7b6e62

由 David Sterba 提交于 10年前

This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528

and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536

Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.

The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.

The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.

CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

ba7b6e62

14 8月, 2014 1 次提交

locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock · 2dfb928f

由 Jeff Layton 提交于 10年前

There's no need to call locks_free_lock here while still holding the
i_lock. Defer that until the lock has been dropped.
Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NJeff Layton <jlayton@primarydata.com>

2dfb928f