提交 · 20e5506baf3fd651e245bc970d8c11a734ee1b8a · openeuler / Kernel

07 1月, 2016 11 次提交

D
btrfs: constify remaining structs with function pointers · 20e5506b
由 David Sterba 提交于 11月 19, 2015
```
* struct extent_io_ops
* struct btrfs_free_space_op
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
20e5506b

btrfs tests: replace whole ops structure for free space tests · 28f0779a

由 David Sterba 提交于 11月 19, 2015

Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

28f0779a

btrfs: don't use slab cache for struct btrfs_delalloc_work · 100d5702

由 David Sterba 提交于 12月 08, 2015

Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.

The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.

The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

100d5702

btrfs: drop duplicate prefix from scrub workqueues · 0de270fa

由 David Sterba 提交于 12月 01, 2015

The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0de270fa

D
btrfs: verbose error when we find an unexpected item in sys_array · 93a3d467
由 David Sterba 提交于 11月 30, 2015
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
93a3d467

btrfs: handle invalid num_stripes in sys_array · f5cdedd7

由 David Sterba 提交于 11月 30, 2015

We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.

A crafted or corrupted image crashes at mount time:

BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
 637af890 60062489 602aeb2e 604192ba
 60387961 00000011 637af8a0 6038a835
 637af9c0 6038776b 634ef32b 00000000
Call Trace:
 [<6001c86d>] show_stack+0xfe/0x15b
 [<6038a835>] dump_stack+0x2a/0x2c
 [<6038776b>] panic+0x13e/0x2b3
 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
 [<601cfbbe>] open_ctree+0x192d/0x27af
 [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<600d710b>] do_mount+0xa35/0xbc9
 [<600d7557>] SyS_mount+0x95/0xc8
 [<6001e884>] handle_syscall+0x6b/0x8e
Reported-by: NJiri Slaby <jslaby@suse.com>
Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org	# 3.19+
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f5cdedd7

btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50

由 David Sterba 提交于 11月 30, 2015

btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.

struct btrfs_delayed_extent_op {
	struct btrfs_disk_key      key;                  /*     0    17 */
	u8                         level;                /*    17     1 */
	bool                       update_key;           /*    18     1 */
	bool                       update_flags;         /*    19     1 */
	bool                       is_data;              /*    20     1 */

	/* XXX 3 bytes hole, try to pack */

	u64                        flags_to_set;         /*    24     8 */

	/* size: 32, cachelines: 1, members: 6 */
	/* sum members: 29, holes: 1, sum holes: 3 */
	/* last cacheline: 32 bytes */
};

The final size is 32 bytes which gives +26 object per slab page.

   text	   data	    bss	    dec	    hex	filename
 938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
 938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
Signed-off-by: NDavid Sterba <dsterba@suse.com>

35b3ad50

btrfs: put delayed item hook into inode · 8089fe62

由 David Sterba 提交于 11月 19, 2015

Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8089fe62

btrfs: Support convert to -d dup for btrfs-convert · c5ca8781

由 Zhao Lei 提交于 11月 19, 2015

Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.

This patch remove limitation of above case.

Tested by following script:
(combination of dup conversion with fsck):

export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'

do_dup_test()
{
    local m_from="$1"
    local d_from="$2"
    local m_to="$3"
    local d_to="$4"

    echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"

    umount "$TEST_DIR" &>/dev/null
    ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
    mount "$TEST_DEV" "$TEST_DIR" || return 1

    cp -a /sbin/* "$TEST_DIR"

    [[ "$m_from" != "$m_to" ]] && {
        ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
    }

    [[ "$d_from" != "$d_to" ]] && {
	local opt=()
	[[ "$d_to" == single ]] && opt+=("-f")
        ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
    }

    umount "$TEST_DIR" || return 1
    ./btrfsck "$TEST_DEV" || return 1
    echo

    return 0
}

test_all()
{
    for m_from in single dup; do
    for d_from in single dup; do
    for m_to in single dup; do
    for d_to in single dup; do
    do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
    done
    done
    done
    done
}

test_all
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c5ca8781

Btrfs: igrab inode in writepage · be7bd730

由 Josef Bacik 提交于 10月 22, 2015

We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

be7bd730

Btrfs: add missing brelse when superblock checksum fails · b2acdddf

由 Anand Jain 提交于 10月 07, 2015

Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b2acdddf

19 12月, 2015 1 次提交

proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter · 41a0c249

由 Colin Ian King 提交于 12月 18, 2015

Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
the setting of ret after the get_proc_task call and incorrectly left it as
-ESRCH.  Instead, return 0 when successful.

Example breakage:

  echo 0 > /proc/self/coredump_filter
  bash: echo: write error: No such process

Fixes: 774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41a0c249

16 12月, 2015 2 次提交

Btrfs: check prepare_uptodate_page() error code earlier · bb1591b4

由 Chris Mason 提交于 12月 14, 2015

prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page.  But if the first call returns an error,
our page will be unlocked and its not safe to call it again.

This bug goes all the way back to 2011, and it's not something commonly
hit.

While we're here, add a more explicit check for the page being truncated
away.  The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.
Reported-by: NDave Jones <dsj@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

bb1591b4

Btrfs: check for empty bitmap list in setup_cluster_bitmaps · 1b9b922a

由 Chris Mason 提交于 12月 15, 2015

Dave Jones found a warning from kasan in setup_cluster_bitmaps()

==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping:          (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G        W
4.4.0-rc3-backup-debug+ #1
 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
 [<ffffffffa680a43e>] dump_stack+0x4b/0x6d
 [<ffffffffa62638d1>] kasan_report_error+0x501/0x520
 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
 [<ffffffffa6263948>] kasan_report+0x58/0x60
 [<ffffffffa6814b00>] ? rb_last+0x10/0x40
 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70
 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520

Andrey noticed this was because we were doing list_first_entry on a list
that might be empty.  Rework the tests a bit so we don't do that.
Signed-off-by: NChris Mason <clm@fb.com>
Reprorted-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: NDave Jones <dsj@fb.com>

1b9b922a

14 12月, 2015 1 次提交

sched/wait: Fix the signal handling fix · dfd01f02

由 Peter Zijlstra 提交于 12月 13, 2015

Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633 ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: NJan Stancek <jstancek@redhat.com>
Reported-by: NChris Mason <clm@fb.com>
Tested-by: NJan Stancek <jstancek@redhat.com>
Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
Tested-by: NChris Mason <clm@fb.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfd01f02

13 12月, 2015 2 次提交

ocfs2: fix SGID not inherited issue · 854ee2e9

由 Junxiao Bi 提交于 12月 11, 2015

Commit 8f1eb487 ("ocfs2: fix umask ignored issue") introduced an
issue, SGID of sub dir was not inherited from its parents dir.  It is
because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
is overwritten by "mode" which don't have SGID set later.

Fixes: 8f1eb487 ("ocfs2: fix umask ignored issue")
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

854ee2e9

osd fs: __r4w_get_page rely on PageUptodate for uptodate · 3066a967

由 Hugh Dickins 提交于 12月 11, 2015

Commit 42cb14b1 ("mm: migrate dirty page without
clear_page_dirty_for_io etc") simplified the migration of a PageDirty
pagecache page: one stat needs moving from zone to zone and that's about
all.

It's convenient and safest for it to shift the PageDirty bit from old
page to new, just before updating the zone stats: before copying data
and marking the new PageUptodate.  This is all done while both pages are
isolated and locked, just as before; and just as before, there's a
moment when the new page is visible in the radix_tree, but not yet
PageUptodate.  What's new is that it may now be briefly visible as
PageDirty before it is PageUptodate.

When I scoured the tree to see if this could cause a problem anywhere,
the only places I found were in two similar functions __r4w_get_page():
which look up a page with find_get_page() (not using page lock), then
claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.

I'm not sure whether that was right before, but now it might be wrong
(on rare occasions): only claim the page is uptodate if PageUptodate.
Or perhaps the page in question could never be migratable anyway?
Signed-off-by: NHugh Dickins <hughd@google.com>
Tested-by: NBoaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3066a967

10 12月, 2015 3 次提交

btrfs: fix misleading warning when space cache failed to load · 94356889

由 Holger Hoffstätte 提交于 11月 27, 2015

When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.
Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: NFilipe Manana <fdmanana@suse.com>

94356889

Btrfs: fix transaction handle leak in balance · 8a7d656f

由 Filipe Manana 提交于 12月 10, 2015

If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.

Fixes: 2c9fe835 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>

8a7d656f

Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013

由 Filipe Manana 提交于 11月 27, 2015

As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

 10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 10264  {
 (...)
 10405                  if (trimming) {
 10406                          WARN_ON(!list_empty(&block_group->bg_list));
 10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
 10408                          list_move(&block_group->bg_list,
 10409                                    &trans->transaction->deleted_bgs);
 10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
 10411                          btrfs_get_block_group(block_group);
 10412                  }
 (...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

            CPU 1                                     CPU 2

 cleaner_kthread()
   btrfs_delete_unused_bgs()

     sees bg X in list
      fs_info->unused_bgs

     deletes bg X from list
      fs_info->unused_bgs

                                            scrub_enumerate_chunks()

                                              searches device tree using
                                              its commit root

                                              finds device extent for
                                              block group X

                                              gets block group X from the tree
                                              fs_info->block_group_cache_tree
                                              (via btrfs_lookup_block_group())

                                              sets bg X to RO (again)

                                              scrub_chunk(bg X)

                                              sets bg X back to RW mode

                                              adds bg X to the list
                                              fs_info->unused_bgs again,
                                              since it's still unused and
                                              currently not in that list

     sets bg X to RO mode

     btrfs_remove_chunk(bg X)

     --> discard is enabled and bg X
         is in the fs_info->unused_bgs
         list again so the warning is
         triggered
     --> we move it from that list into
         the transaction's delete_bgs
         list, but we can have another
         task currently manipulating
         the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
Signed-off-by: NFilipe Manana <fdmanana@suse.com>

348a0013

09 12月, 2015 2 次提交

fix the regression from "direct-io: Fix negative return from dio read beyond eof" · 2d4594ac

由 Al Viro 提交于 12月 08, 2015

Sure, it's better to bail out of past-the-eof read and return 0 than return
a bogus negative value on such.  Only we'd better make sure we are bailing out
with 0 and not -ENOMEM...

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2d4594ac

9p: ->evict_inode() should kick out ->i_data, not ->i_mapping · 4ad78628

由 Al Viro 提交于 12月 08, 2015

For block devices the pagecache is associated with the inode
on bdevfs, not with the aliasing ones on the mountable filesystems.
The latter have its own ->i_data empty and ->i_mapping pointing
to the (unique per major/minor) bdevfs inode.  That guarantees
cache coherence between all block device inodes with the same
device number.

Eviction of an alias inode has no business trying to evict the
pages belonging to bdevfs one; moreover, ->i_mapping is only
safe to access when the thing is opened.  At the time of
->evict_inode() the victim is definitely *not* opened.  We are
about to kill the address space embedded into struct inode
(inode->i_data) and that's what we need to empty of any pages.

9p instance tries to empty inode->i_mapping instead, which is
both unsafe and bogus - if we have several device nodes with
the same device number in different places, closing one of them
should not try to empty the (shared) page cache.

Fortunately, other instances in the tree are OK; they are
evicting from &inode->i_data instead, as 9p one should.

Cc: stable@vger.kernel.org # v2.6.32+, ones prior to 2.6.36 need only half of that
Reported-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Tested-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4ad78628

08 12月, 2015 1 次提交

SUNRPC: Fix callback channel · 756b9b37

由 Trond Myklebust 提交于 12月 07, 2015

The NFSv4.1 callback channel is currently broken because the receive
message will keep shrinking because the backchannel receive buffer size
never gets reset.
The easiest solution to this problem is instead of changing the receive
buffer, to rather adjust the copied request.

Fixes: 38b7631f ("nfs4: limit callback decoding to received bytes")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

756b9b37

07 12月, 2015 3 次提交

Don't reset ->total_link_count on nested calls of vfs_path_lookup() · 2788cc47

由 Al Viro 提交于 12月 06, 2015

we already zero it on outermost set_nameidata(), so initialization in
path_init() is pointless and wrong.  The same DoS exists on pre-4.2
kernels, but there a slightly different fix will be needed.

Cc: stable@vger.kernel.org # v4.2
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2788cc47

A
ovl: get rid of the dead code left from broken (and disabled) optimizations · 0f7ff2da
由 Al Viro 提交于 12月 06, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0f7ff2da

ovl: fix permission checking for setattr · acff81ec

由 Miklos Szeredi 提交于 12月 04, 2015

[Al Viro] The bug is in being too enthusiastic about optimizing ->setattr()
away - instead of "copy verbatim with metadata" + "chmod/chown/utimes"
(with the former being always safe and the latter failing in case of
insufficient permissions) it tries to combine these two.  Note that copyup
itself will have to do ->setattr() anyway; _that_ is where the elevated
capabilities are right.  Having these two ->setattr() (one to set verbatim
copy of metadata, another to do what overlayfs ->setattr() had been asked
to do in the first place) combined is where it breaks.
Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

acff81ec

05 12月, 2015 2 次提交

block: detach bdev inode from its wb in __blkdev_put() · 43d1c0eb

由 Ilya Dryomov 提交于 11月 20, 2015

Since 52ebea74 ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d39 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Tested-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

43d1c0eb

jbd2: fix null committed data return in undo_access · 087ffd4e

由 Junxiao Bi 提交于 12月 04, 2015

introduced jbd2_write_access_granted() to improve write|undo_access
speed, but missed to check the status of b_committed_data which caused
a kernel panic on ocfs2.

[ 6538.405938] ------------[ cut here ]------------
[ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
[ 6538.406686] invalid opcode: 0000 [#1] SMP
[ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
[ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
[ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
[ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
[ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
[ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
[ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
[ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
[ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
[ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
[ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
[ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
[ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
[ 6538.406686] Stack:
[ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
[ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
[ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
[ 6538.406686] Call Trace:
[ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
[ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
[ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
[ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
[ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
[ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
[ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
[ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
[ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
[ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
[ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
[ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
[ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
[ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
[ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
[ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
[ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
[ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
[ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
[ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
[ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
[ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
[ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
[ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
[ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
[ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
[ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
[ 6538.406686]  RSP <ffff880075b7b7f8>
[ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
[ 6538.694492] Kernel panic - not syncing: Fatal exception
[ 6538.695484] Kernel Offset: disabled

Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
Cc: <stable@vger.kernel.org>
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

087ffd4e

02 12月, 2015 1 次提交

net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072

由 Eric Dumazet 提交于 11月 29, 2015

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9cd3e072

01 12月, 2015 1 次提交

direct-io: Fix negative return from dio read beyond eof · 74cedf9b

由 Jan Kara 提交于 11月 30, 2015

Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
tail of the last block and the logic for handling short DIO reads in
dio_complete() results in a return value -24 (1000 - 1024) which
obviously confuses userspace.

Fix the problem by bailing out early once we sample i_size and can
reliably check that direct IO read starts beyond i_size.
Reported-by: NAvi Kivity <avi@scylladb.com>
Fixes: 9fe55eea
CC: stable@vger.kernel.org
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

74cedf9b

27 11月, 2015 3 次提交

ext4: add "static" to ext4_seq_##name##_fops struct · 681c46b1

由 Xu Cang 提交于 11月 26, 2015

to fix sparse warning, add static to ext4_seq_##name##_fops struct.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

681c46b1

ext4: fix an endianness bug in ext4_encrypted_follow_link() · 5a1c7f47

由 Al Viro 提交于 11月 26, 2015

applying le32_to_cpu() to 16bit value is a bad idea...
    
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

5a1c7f47

ext4: fix an endianness bug in ext4_encrypted_zeroout() · e2c9e0b2

由 Al Viro 提交于 11月 26, 2015

ex->ee_block is not host-endian (note that accesses of other fields
of *ex right next to that line go through the helpers that do proper
conversion from little-endian to host-endian; it might make sense
to add similar for ->ee_block to avoid reintroducing that kind of
bugs...)

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

e2c9e0b2

26 11月, 2015 3 次提交

nfs4: resend LAYOUTGET when there is a race that changes the seqid · 4f2e9dce

由 Jeff Layton 提交于 11月 25, 2015

pnfs_layout_process will check the returned layout stateid against what
the kernel has in-core. If it turns out that the stateid we received is
older, then we should resend the LAYOUTGET instead of falling back to
MDS I/O.
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

4f2e9dce

nfs: if we have no valid attrs, then don't declare the attribute cache valid · c812012f

由 Jeff Layton 提交于 11月 25, 2015

If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
(correctly) not update any of the attributes, but it then clears the
NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
up to date. Don't clear the flag if the fattr struct has no valid
attrs to apply.
Reviewed-by: NSteve French <steve.french@primarydata.com>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

c812012f

nfs: ensure that attrcache is revalidated after a SETATTR · 616c3196

由 Jeff Layton 提交于 11月 25, 2015

If we get no post-op attributes back from a SETATTR operation, then no
attributes will of course be updated during the call to
nfs_update_inode.

We know however that the attributes are invalid at that point, since we
just changed some of them. At the very least, the ctime will be bogus.
If we get no post-op attributes back on the call, mark the attrcache
invalid to reflect that fact.
Reviewed-by: NSteve French <steve.french@primarydata.com>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

616c3196

25 11月, 2015 4 次提交

btrfs: fix balance range usage filters in 4.4-rc · dba72cb3

由 Holger Hoffstätte 提交于 11月 17, 2015

There's a regression in 4.4-rc since commit bc309467
(btrfs: extend balance filter usage to take minimum and maximum) in that
existing (non-ranged) balance with -dusage=x no longer works; all chunks
are skipped.

After staring at the code for a while and wondering why a non-ranged
balance would even need min and max thresholds (..which then were not
set correctly, leading to the bug) I realized that the only problem
was the fact that the filter functions were named wrong, thanks to
patching copypasta. Simply renaming both functions lets the existing
btrfs-progs call balance with -dusage=x and now the non-ranged filter
function is invoked, properly using only a single chunk limit.
Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

dba72cb3

btrfs: qgroup: account shared subtree during snapshot delete · 82bd101b

由 Mark Fasheh 提交于 11月 05, 2015

Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651a (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Signed-off-by: NChris Mason <clm@fb.com>

82bd101b

Btrfs: use btrfs_get_fs_root in resolve_indirect_ref · 2d9e9776

由 Josef Bacik 提交于 11月 05, 2015

The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Signed-off-by: NChris Mason <clm@fb.com>

2d9e9776

btrfs: qgroup: fix quota disable during rescan · 967ef513

由 Justin Maggard 提交于 11月 06, 2015

There's a race condition that leads to a NULL pointer dereference if you
disable quotas while a quota rescan is running.  To fix this, we just need
to wait for the quota rescan worker to actually exit before tearing down
the quota structures.
Signed-off-by: NJustin Maggard <jmaggard@netgear.com>
Signed-off-by: NChris Mason <clm@fb.com>

967ef513

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功