提交 · c9e15f25f514a76d906be01e621f400cdee94558 · openeuler / raspberrypi-kernel

03 4月, 2015 1 次提交

debugfs: allow bad parent pointers to be passed in · c9e15f25

由 Greg KH 提交于 3月 30, 2015

If something went wrong with creating a debugfs file/symlink/directory,
that value could be passed down into debugfs again as a parent dentry.
To make caller code simpler, just error out if this happens, and don't
crash the kernel.
Reported-by: NAlex Elder <elder@linaro.org>
Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NAlex Elder <elder@linaro.org>

c9e15f25

25 3月, 2015 2 次提交

sysfs: Only accept read/write permissions for file attributes · d8bf8c92

由 Vivien Didelot 提交于 3月 12, 2015

For sysfs file attributes, only read and write permissions make sense.
Mask provided attribute permissions accordingly and send a warning
to the console if invalid permission bits are set.

This patch is originally from Guenter [1] and includes the fixup
explained in the thread, that is printing permissions in octal format
and limiting the scope of attributes to SYSFS_PREALLOC | 0664.

[1] https://lkml.org/lkml/2015/1/19/599Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d8bf8c92

sysfs: Use only return value from is_visible for the file mode · da4759c7

由 Guenter Roeck 提交于 3月 12, 2015

Up to now, is_visible can only be used to either remove visibility
of a file entirely or to add permissions, but not to reduce permissions.
This makes it impossible, for example, to use DEVICE_ATTR_RW to define
file attributes and reduce permissions to read-only.

This behavior is undesirable and unnecessarily complicates code which
needs to reduce permissions; instead of just returning the desired
permissions, it has to ensure that the permissions in the attribute
variable declaration only reflect the minimal permissions ever needed.

Change semantics of is_visible to only use the permissions returned
from it instead of oring the returned value with the hard-coded
permissions.
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

da4759c7

20 3月, 2015 1 次提交

Subject: nfsd: don't recursively call nfsd4_cb_layout_fail · 133d5582

由 Christoph Hellwig 提交于 3月 05, 2015

Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS
layout recalls"), we recursively call nfsd4_cb_layout_fail from itself,
leading to stack overflows.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Fixes:  c5c707f9 ("nfsd: implement pNFS layout recalls")
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
---
 fs/nfsd/nfs4layouts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 3c1bfa1..1028a06 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)

 	rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));

-	nfsd4_cb_layout_fail(ls);
-
 	printk(KERN_WARNING
 		"nfsd: client %s failed to respond to layout recall. "
 		"  Fencing..\n", addr_str);
--
1.9.1

133d5582

19 3月, 2015 1 次提交

fuse: explicitly set /dev/fuse file's private_data · 94e4fe2c

由 Tom Van Braeckel 提交于 1月 12, 2015

The misc subsystem (which is used for /dev/fuse) initializes private_data to
point to the misc device when a driver has registered a custom open file
operation, and initializes it to NULL when a custom open file operation has
*not* been provided.

This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc device
structure. And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.

So to simplify things in the misc subsystem, a patch [1] has been proposed to
*always* set the private_data to point to the misc device, instead of only
doing this when a custom open file operation has been registered.

But before this patch can be applied we need to modify drivers that make the
assumption that a misc device file's private_data is initialized to NULL
because they didn't register a custom open file operation, so they don't rely
on this assumption anymore. FUSE uses private_data to store the fuse_conn and
errors out if this is not initialized to NULL at mount time.

Hence, we now set a file's private_data to NULL explicitly, to be independent
of whatever value the misc subsystem initializes it to by default.

[1] https://lkml.org/lkml/2014/12/4/939Reported-by: NGiedrius Statkevicius <giedriuswork@gmail.com>
Reported-by: NThierry Reding <thierry.reding@gmail.com>
Signed-off-by: NTom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

94e4fe2c

18 3月, 2015 8 次提交

ovl: upper fs should not be R/O · 71cbad7e

由 hujianyang 提交于 1月 15, 2015

After importing multi-lower layer support, users could mount a r/o
partition as the left most lowerdir instead of using it as upperdir.
And a r/o upperdir may cause an error like

	overlayfs: failed to create directory ./workdir/work

during mount.

This patch check the *s_flags* of upper fs and return an error if
it is a r/o partition. The checking of *upper_mnt->mnt_sb->s_flags*
can be removed now.

This patch also remove

	/* FIXME: workdir is not needed for a R/O mount */

from ovl_fill_super() because:

1) for upper fs r/o case
Setting a r/o partition as upper is prevented, no need to care about
workdir in this case.

2) for "mount overlay -o ro" with a r/w upper fs case
Users could remount overlayfs to r/w in this case, so workdir should
not be omitted.
Signed-off-by: Nhujianyang <hujianyang@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

71cbad7e

ovl: check lowerdir amount for non-upper mount · 6be4506e

由 hujianyang 提交于 1月 15, 2015

Recently multi-lower layer mount support allow upperdir and workdir
to be omitted, then cause overlayfs can be mount with only one
lowerdir directory. This action make no sense and have potential risk.

This patch check the total number of lower directories to prevent
mounting overlayfs with only one directory.

Also, an error message is added to indicate lower directories exceed
OVL_MAX_STACK limit.
Signed-off-by: Nhujianyang <hujianyang@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

6be4506e

ovl: print error message for invalid mount options · bead55ef

由 hujianyang 提交于 1月 15, 2015

Overlayfs should print an error message if an incorrect mount option
is caught like other filesystems.

After this patch, improper option input could be clearly known.
Reported-by: NFabian Sturm <fabian.sturm@aduu.de>
Signed-off-by: Nhujianyang <hujianyang@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

bead55ef

Btrfs: fix outstanding_extents accounting in DIO · e1cbbfa5

由 Josef Bacik 提交于 3月 17, 2015

We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested. This isn't quite right since we will
be limited to our max extent size. So instead lets do something horrible! Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent. If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly. Thanks,
Reported-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

e1cbbfa5

Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5

由 Josef Bacik 提交于 3月 16, 2015

I introduced a regression wrt outstanding_extents accounting.  These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time.  So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>

6a3891c5

Btrfs: just free dummy extent buffers · bcb7e449

由 Josef Bacik 提交于 3月 16, 2015

If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>

bcb7e449

Btrfs: account merges/splits properly · ba117213

由 Josef Bacik 提交于 3月 13, 2015

My fix

Btrfs: fix merge delalloc logic

only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent.  We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent.  If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation.  This shows up with a case like this

[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]

Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation.  But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations.  The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>

ba117213

pagemap: do not leak physical addresses to non-privileged userspace · ab676b7d

由 Kirill A. Shutemov 提交于 3月 09, 2015

As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.

This disallows anybody without CAP_SYS_ADMIN to read the pagemap.

[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

[ Eventually we might want to do anything more finegrained, but for now
  this is the simple model.   - Linus ]
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Seaborn <mseaborn@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab676b7d

17 3月, 2015 2 次提交

Btrfs: prepare block group cache before writing · dcdf7f6d

由 Josef Bacik 提交于 3月 02, 2015

Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff. To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>

dcdf7f6d

kernfs: handle poll correctly on 'direct_read' files. · 7cff4b18

由 NeilBrown 提交于 3月 16, 2015

Kernfs supports two styles of read: direct_read and seqfile_read.

The latter supports 'poll' correctly thanks to the update of
'->event' in kernfs_seq_show.
The former does not as '->event' is never updated on a read.

So add an appropriate update in kernfs_file_direct_read().

This was noticed because some 'md' sysfs attributes were
recently changed to use direct reads.
Reported-by: NPrakash Punnoor <prakash@punnoor.de>
Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
Fixes: 750f199eSigned-off-by: NNeilBrown <neilb@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7cff4b18

14 3月, 2015 7 次提交

locks: fix generic_delete_lease tracepoint to use victim pointer · a9b1b455

由 Jeff Layton 提交于 3月 14, 2015

It's possible that "fl" won't point at a valid lock at this point, so
use "victim" instead which is either a valid lock or NULL.
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>

a9b1b455

Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) · ea526d18

由 Josef Bacik 提交于 3月 13, 2015

Dave could hit this assert consistently running btrfs/078. This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root. For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case. This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty. This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root. Thanks,
Reported-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

ea526d18

Btrfs: account for the correct number of extents for delalloc reservations · 6a41dd09

由 Josef Bacik 提交于 3月 13, 2015

Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

6a41dd09

Btrfs: fix merge delalloc logic · 8461a3de

由 Josef Bacik 提交于 3月 13, 2015

My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas. We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic. This fixes the regression that was reported by a user on
list. Thanks,
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

8461a3de

Btrfs: fix comp_oper to get right order · 48da5f0a

由 Liu Bo 提交于 3月 13, 2015

Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

48da5f0a

Btrfs: catch transaction abortion after waiting for it · b4924a0f

由 Liu Bo 提交于 3月 06, 2015

This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.

Fsync() can report success when it actually doesn't. When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.

It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return. Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.

This add the necessary check and it now passes the test.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

b4924a0f

btrfs: fix sizeof format specifier in btrfs_check_super_valid() · d2207129

由 Fabian Frederick 提交于 2月 14, 2015

This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NChris Mason <clm@fb.com>

d2207129

13 3月, 2015 3 次提交

fanotify: fix event filtering with FAN_ONDIR set · b3c1030d

由 Suzuki K. Poulose 提交于 3月 12, 2015

With FAN_ONDIR set, the user can end up getting events, which it hasn't
marked.  This was revealed with fanotify04 testcase failure on
Linux-4.0-rc1, and is a regression from 3.19, revealed with 66ba93c0
("fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask").

   # /opt/ltp/testcases/bin/fanotify04
   [ ... ]
  fanotify04    7  TPASS  :  event generated properly for type 100000
  fanotify04    8  TFAIL  :  fanotify04.c:147: got unexpected event 30
  fanotify04    9  TPASS  :  No event as expected

The testcase sets the adds the following marks : FAN_OPEN | FAN_ONDIR for
a fanotify on a dir.  Then does an open(), followed by close() of the
directory and expects to see an event FAN_OPEN(0x20).  However, the
fanotify returns (FAN_OPEN|FAN_CLOSE_NOWRITE(0x10)).  This happens due to
the flaw in the check for event_mask in fanotify_should_send_event() which
does:

	if (event_mask & marks_mask & ~marks_ignored_mask)
		return true;

where, event_mask == (FAN_ONDIR | FAN_CLOSE_NOWRITE),
       marks_mask == (FAN_ONDIR | FAN_OPEN),
       marks_ignored_mask == 0

Fix this by masking the outgoing events to the user, as we already take
care of FAN_ONDIR and FAN_EVENT_ON_CHILD.
Signed-off-by: NSuzuki K. Poulose <suzuki.poulose@arm.com>
Tested-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b3c1030d

nilfs2: fix deadlock of segment constructor during recovery · 283ee148

由 Ryusuke Konishi 提交于 3月 12, 2015

According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
during recovery at mount time.  The code path that caused the deadlock was
as follows:

  nilfs_fill_super()
    load_nilfs()
      nilfs_salvage_orphan_logs()
        * Do roll-forwarding, attach segment constructor for recovery,
          and kick it.

        nilfs_segctor_thread()
          nilfs_segctor_thread_construct()
           * A lock is held with nilfs_transaction_lock()
             nilfs_segctor_do_construct()
               nilfs_segctor_drop_written_files()
                 iput()
                   iput_final()
                     write_inode_now()
                       writeback_single_inode()
                         __writeback_single_inode()
                           do_writepages()
                             nilfs_writepage()
                               nilfs_construct_dsync_segment()
                                 nilfs_transaction_lock() --> deadlock

This can happen if commit 7ef3ff2f ("nilfs2: fix deadlock of segment
constructor over I_SYNC flag") is applied and roll-forward recovery was
performed at mount time.  The roll-forward recovery can happen if datasync
write is done and the file system crashes immediately after that.  For
instance, we can reproduce the issue with the following steps:

 < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
 # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
 # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
 count=1 && reboot -nfh
 < the system will immediately reboot >
 # mount -t nilfs2 /dev/sdb1 /nilfs

The deadlock occurs because iput() can run segment constructor through
writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
above commit changed segment constructor so that it calls iput()
asynchronously for inodes with i_nlink == 0, but that change was
imperfect.

This fixes the another deadlock by deferring iput() in segment constructor
even for the case that mount is not finished, that is, for the case that
MS_ACTIVE flag is not set.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: NYuxuan Shui <yshuiv7@gmail.com>
Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

283ee148

ocfs2: make append_dio an incompat feature · 18d585f0

由 Mark Fasheh 提交于 3月 12, 2015

It turns out that making this feature ro_compat isn't quite enough to
prevent accidental corruption on mount from older kernels.  Ocfs2 (like
other file systems) will process orphaned inodes even when the user mounts
in 'ro' mode.  So for the case of a filesystem not knowing the append_dio
feature, mounting the filesystem could result in orphaned-for-dio files
being deleted, which we clearly don't want.

So instead, turn this into an incompat flag.

Btw, this is kind of my fault - initially I asked that we add a flag to
cover the feature and even suggested that we use an ro flag.  It wasn't
until I was looking through our commits for v4.0-rc1 that I realized we
actually want this to be incompat.
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18d585f0

06 3月, 2015 3 次提交

Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · dd9ef135

由 Quentin Casasnovas 提交于 3月 03, 2015

Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.
Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: NChris Mason <clm@fb.com>

dd9ef135

Btrfs: fix data loss in the fast fsync path · 3a8b36f3

由 Filipe Manana 提交于 3月 01, 2015

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

3a8b36f3

Btrfs: remove extra run_delayed_refs in update_cowonly_root · f5c0a122

由 Josef Bacik 提交于 3月 02, 2015

This got added with my dirty_bgs patch, it's not needed.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

f5c0a122

05 3月, 2015 1 次提交

locks: fix fasync_struct memory leak in lease upgrade/downgrade handling · 0164bf02

由 Jeff Layton 提交于 3月 04, 2015

Commit 8634b51f (locks: convert lease handling to file_lock_context)
introduced a regression in the handling of lease upgrade/downgrades.

In the event that we already have a lease on a file and are going to
either upgrade or downgrade it, we skip doing any list insertion or
deletion and simply re-call lm_setup on the existing lease.

As of commit 8634b51f however, we end up calling lm_setup on the
lease that was passed in, instead of on the existing lease. This causes
us to leak the fasync_struct that was allocated in the event that there
was not already an existing one (as it always appeared that there
wasn't one).

Fixes: 8634b51f (locks: convert lease handling to file_lock_context)
Reported-and-Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>

0164bf02

04 3月, 2015 4 次提交

NFSv4.1: Clear the old state by our client id before establishing a new lease · e11259f9

由 Trond Myklebust 提交于 3月 03, 2015

If the call to exchange-id returns with the EXCHGID4_FLAG_CONFIRMED_R flag
set, then that means our lease was established by a previous mount instance.
Ensure that we detect this situation, and that we clear the state held by
that mount.
Reported-by: NJorge Mora <Jorge.Mora@netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

e11259f9

NFSv4: Fix a race in NFSv4.1 server trunking discovery · 48d66b97

由 Trond Myklebust 提交于 3月 03, 2015

We do not want to allow a race with another NFS mount to cause
nfs41_walk_client_list() to establish a lease on our nfs_client before
we're done checking for trunking.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

48d66b97

NFS: Don't write enable new pages while an invalidation is proceeding · ef070dcb

由 Trond Myklebust 提交于 3月 03, 2015

nfs_vm_page_mkwrite() should wait until the page cache invalidation
is finished. This is the second patch in a 2 patch series to deprecate
the NFS client's reliance on nfs_release_page() in the context of
nfs_invalidate_mapping().
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

ef070dcb

NFS: Fix a regression in the read() syscall · 874f9463

由 Trond Myklebust 提交于 3月 02, 2015

When invalidating the page cache for a regular file, we want to first
sync all dirty data to disk and then call invalidate_inode_pages2().
The latter relies on nfs_launder_page() and nfs_release_page() to deal
respectively with dirty pages, and unstable written pages.

When commit 95905446 ("NFS: avoid deadlocks with loop-back mounted
NFS filesystems.") changed the behaviour of nfs_release_page(), then it
made it possible for invalidate_inode_pages2() to fail with an EBUSY.
Unfortunately, that error is then propagated back to read().

Let's therefore work around the problem for now by protecting the call
to sync the data and invalidate_inode_pages2() so that they are atomic
w.r.t. the addition of new writes.
Later on, we can revisit whether or not we still need nfs_launder_page()
and nfs_release_page().
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

874f9463

03 3月, 2015 7 次提交

eCryptfs: don't pass fs-specific ioctl commands through · 6d65261a

由 Tyler Hicks 提交于 2月 24, 2015

eCryptfs can't be aware of what to expect when after passing an
arbitrary ioctl command through to the lower filesystem. The ioctl
command may trigger an action in the lower filesystem that is
incompatible with eCryptfs.

One specific example is when one attempts to use the Btrfs clone
ioctl command when the source file is in the Btrfs filesystem that
eCryptfs is mounted on top of and the destination fd is from a new file
created in the eCryptfs mount. The ioctl syscall incorrectly returns
success because the command is passed down to Btrfs which thinks that it
was able to do the clone operation. However, the result is an empty
eCryptfs file.

This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
commands through and then copies up the inode metadata from the lower
inode to the eCryptfs inode to catch any changes made to the lower
inode's metadata. Those five ioctl commands are mostly common across all
filesystems but the whitelist may need to be further pruned in the
future.

https://bugzilla.kernel.org/show_bug.cgi?id=93691
https://launchpad.net/bugs/1305335Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
Cc: Rocko <rockorequin@hotmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8f eCryptfs: Handle ioctl calls with unlocked and compat functions

6d65261a

NFSv4: Ensure we skip delegations that are already being returned · ec3ca4e5

由 Trond Myklebust 提交于 2月 26, 2015

In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
we want to optimise the loop traversal by skipping delegations that are
already in the process of being returned.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

ec3ca4e5

NFSv4: Pin the superblock while we're returning the delegation · 9f0f8e12

由 Trond Myklebust 提交于 2月 26, 2015

This patch ensures that the superblock doesn't go ahead and disappear
underneath us while the state manager thread is returning delegations.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

9f0f8e12

NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation() · ade04647

由 Trond Myklebust 提交于 2月 27, 2015

Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
delegation that is already in the process of being returned.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

ade04647

T
NFSv4: Ensure that we don't reap a delegation that is being returned · b04b22f4
由 Trond Myklebust 提交于 2月 26, 2015
```
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
```
b04b22f4

NFS: Fix stateid used for NFS v4 closes · 369d6b7f

由 Anna Schumaker 提交于 3月 02, 2015

After 566fcec6 the client uses the "current stateid" from the
nfs4_state structure to close a file.  This could potentially contain a
delegation stateid, which is disallowed by the protocol and causes
servers to return NFS4ERR_BAD_STATEID.  This patch restores the
(correct) behavior of sending the open stateid to close a file.
Reported-by: NOlga Kornievskaia <kolga@netapp.com>
Fixes: 566fcec6 (NFSv4: Fix an atomicity problem in CLOSE)
Signed-off-by: NAnna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

369d6b7f

Btrfs: incremental send, don't rename a directory too soon · 84471e24

由 Filipe Manana 提交于 2月 28, 2015

There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/file

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/c /mnt/x
  $ mv /mnt/a /mnt/x/y
  $ mv /mnt/b /mnt/a

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.
Reported-by: NAmes Cornish <ames@cornishes.net>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

84471e24