提交 · 51090c5d6de08cfc86b2d861775dedddd9a2c023 · openeuler / Kernel

07 12月, 2017 6 次提交

proc: show si_ptr in /proc/<pid>/timers without hashing · ba3edf1f

由 Linus Torvalds 提交于 12月 06, 2017

It's a user pointer, and while the permissions of the file are pretty
questionable (should it really be readable to everybody), hashing the
pointer isn't going to be the solution.

We should take a closer look at more of the /proc/<pid> file permissions
in general.  Sure, we do want many of them to often be readable (for
'ps' and friends), but I think we should probably do a few conversions
from S_IRUGO to S_IRUSR.
Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ba3edf1f

btrfs: Fix possible off-by-one in btrfs_search_path_in_tree · c8bcbfbd

由 Nikolay Borisov 提交于 12月 01, 2017

The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ added implications ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c8bcbfbd

Btrfs: disable FUA if mounted with nobarrier · 1b9e619c

由 Omar Sandoval 提交于 12月 05, 2017

I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.

Fixes: 387125fc ("Btrfs: fix barrier flushes")
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1b9e619c

btrfs: fix missing error return in btrfs_drop_snapshot · e19182c0

由 Jeff Mahoney 提交于 12月 04, 2017

If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.

Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e19182c0

btrfs: handle errors while updating refcounts in update_ref_for_cow · 692826b2

由 Jeff Mahoney 提交于 11月 21, 2017

Since commit fb235dc0 (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false.  The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.

Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.

Fixes: fb235dc0 (btrfs: qgroup: Move half of the qgroup accounting ...)
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NEdmund Nadolski <enadolski@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

692826b2

btrfs: Fix quota reservation leak on preallocated files · b430b775

由 Justin Maggard 提交于 10月 30, 2017

Commit c6887cd1 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.

If we have quotas enabled, the data space reservation also includes a
quota reservation.  But in the rewrite case, the space has already been
accounted for in qgroups.  So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data.  So we're
left with both the qgroup and qgroup reservation accounting for the same
space.

This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.

Fixes: c6887cd1 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: NJustin Maggard <jmaggard@netgear.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b430b775

01 12月, 2017 6 次提交

afs: Properly reset afs_vnode (inode) fields · f8de483e

由 David Howells 提交于 12月 01, 2017

When an AFS inode is allocated by afs_alloc_inode(), the allocated
afs_vnode struct isn't necessarily reset from the last time it was used as
an inode because the slab constructor is only invoked once when the memory
is obtained from the page allocator.

This means that information can leak from one inode to the next because
we're not calling kmem_cache_zalloc().  Some of the information isn't
reset, in particular the permit cache pointer.

Bring the clearances up to date.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMarc Dionne <marc.dionne@auristor.com>

f8de483e

afs: Fix permit refcounting · 1bcab125

由 David Howells 提交于 12月 01, 2017

Fix four refcount bugs in afs_cache_permit():

 (1) When checking the result of the kzalloc(), we can't just return, but
     must put 'permits'.

 (2) We shouldn't put permits immediately after hashing a new permit as we
     need to keep the pointer stable so that we can check to see if
     vnode->permit_cache has changed before we decide whether to assign to
     it.

 (3) 'permits' is being put twice.

 (4) We need to put either the replacement or the thing replaced after the
     assignment to vnode->permit_cache.

Without this, lots of the following are seen:

  Kernel BUG at ffffffffa039857b [verbose debug info unavailable]
  ------------[ cut here ]------------
  Kernel BUG at ffffffffa039858a [verbose debug info unavailable]
  ------------[ cut here ]------------

The addresses are in the .text..refcount section of the kafs.ko module.
Following the relocation records for the __ex_table section shows one to be
due to the decrement in afs_put_permits() and the other to be key_get() in
afs_cache_permit().

Occasionally, the following is seen:

  refcount_t overflow at afs_cache_permit+0x57d/0x5c0 [kafs] in cc1[562], uid/euid: 0/0
  WARNING: CPU: 0 PID: 562 at kernel/panic.c:657 refcount_error_report+0x9c/0xac
  ...
Reported-by: NMarc Dionne <marc.dionne@auristor.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMarc Dionne <marc.dionne@auristor.com>

1bcab125

xfs: Properly retry failed dquot items in case of error during buffer writeback · 373b0589

由 Carlos Maiolino 提交于 11月 28, 2017

Once the inode item writeback errors is already fixed, it's time to fix the same
problem in dquot code.

Although there were no reports of users hitting this bug in dquot code (at least
none I've seen), the bug is there and I was already planning to fix it when the
correct approach to fix the inodes part was decided.

This patch aims to fix the same problem in dquot code, regarding failed buffers
being unable to be resubmitted once they are flush locked.

Tested with the recently test-case sent to fstests list by Hou Tao.
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

373b0589

xfs: scrub inode mode properly · 3b42d385

由 Darrick J. Wong 提交于 11月 27, 2017

Since we've used up all the bits in i_mode, the existing mode check
doesn't actually do anything useful.  However, we've not used all the
bit values in the format portion of i_mode, so we /do/ need to test
that for bad values.

Fixes: 80e4e126 ("xfs: scrub inodes")
Fixes-coverity-id: 1423992
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

3b42d385

xfs: remove unused parameter from xfs_writepage_map · 2d5f4b5b

由 Darrick J. Wong 提交于 11月 27, 2017

The first thing that xfs_writepage_map does is clobber the offset
parameter. Since we never use the passed-in value, turn the parameter
into a local variable. This gets rid of an UBSAN warning in generic/466.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

2d5f4b5b

xfs: ubsan fixes · 22a6c837

由 Darrick J. Wong 提交于 11月 27, 2017

Fix some complaints from the UBSAN about signed integer addition overflows.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

22a6c837

30 11月, 2017 8 次提交

fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() · 72639e6d

由 Nadav Amit 提交于 11月 29, 2017

hugetlfs_fallocate() currently performs put_page() before unlock_page().
This scenario opens a small time window, from the time the page is added
to the page cache, until it is unlocked, in which the page might be
removed from the page-cache by another core.  If the page is removed
during this time windows, it might cause a memory corruption, as the
wrong page will be unlocked.

It is arguable whether this scenario can happen in a real system, and
there are several mitigating factors.  The issue was found by code
inspection (actually grep), and not by actually triggering the flow.
Yet, since putting the page before unlocking is incorrect it should be
fixed, if only to prevent future breakage or someone copy-pasting this
code.

Mike said:
 "I am of the opinion that this does not need to be sent to stable.
  Although the ordering is current code is incorrect, there is no way
  for this to be a problem with current locking. In addition, I verified
  that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
  for hugetlbfs and other filesystems is addressed in 3a77d214 ("mm:
  fadvise: avoid fadvise for fs without backing device")"

Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
Fixes: 70c3547e ("hugetlbfs: add hugetlbfs_fallocate()")
Signed-off-by: NNadav Amit <namit@vmware.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72639e6d

autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored" · 5d38f049

由 Ian Kent 提交于 11月 29, 2017

Commit 42f46148 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
flag but introduced a semantic change.

In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
negative dentry case for stat family system calls in follow_automount().

This changed the unconditional triggering of an automount in this case
to no longer be done and an error returned instead.

This has caused more problems than I expected so reverting the change is
needed.

In a discussion with Neil Brown it was concluded that the automount(8)
daemon can implement this change without kernel modifications.  So that
will be done instead and the autofs module documentation updated with a
description of the problem and what needs to be done by module users for
this specific case.

Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net
Fixes: 42f46148 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
Signed-off-by: NIan Kent <raven@themaw.net>
Cc: Neil Brown <neilb@suse.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d38f049

autofs: revert "autofs: take more care to not update last_used on path walk" · 43694d4b

由 Ian Kent 提交于 11月 29, 2017

While commit 092a5345 ("autofs: take more care to not update
last_used on path walk") helped (partially) resolve a problem where
automounts were not expiring due to aggressive accesses from user space
it has a side effect for very large environments.

This change helps with the expire problem by making the expire more
aggressive but, for very large environments, that means more mount
requests from clients.  When there are a lot of clients that can mean
fairly significant server load increases.

It turns out I put the last_used in this position to solve this very
problem and failed to update my own thinking of the autofs expire
policy.  So the patch being reverted introduces a regression which
should be fixed.

Link: http://lkml.kernel.org/r/151174729420.6162.1832622523537052460.stgit@pluto.themaw.net
Fixes: 092a5345 ("autofs: take more care to not update last_used on path walk")
Signed-off-by: NIan Kent <raven@themaw.net>
Reviewed-by: NNeilBrown <neilb@suse.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>	[4.11+]
Cc: Colin Walters <walters@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

43694d4b

fs/fat/inode.c: fix sb_rdonly() change · b6e8e12c

由 OGAWA Hirofumi 提交于 11月 29, 2017

Commit bc98a42c ("VFS: Convert sb->s_flags & MS_RDONLY to
sb_rdonly(sb)") converted fat_remount():new_rdonly from a bool to an
int.

However fat_remount() depends upon the compiler's conversion of a
non-zero integer into boolean `true'.

Fix it by switching `new_rdonly' back into a bool.

Link: http://lkml.kernel.org/r/87mv3d5x51.fsf@mail.parknet.co.jp
Fixes: bc98a42c ("VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)")
Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Joe Perches <joe@perches.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6e8e12c

fs/mbcache.c: make count_objects() more robust · d5dabd63

由 Jiang Biao 提交于 11月 29, 2017

When running ltp stress test for 7*24 hours, vmscan occasionally emits
the following warning continuously:

  mb_cache_scan+0x0/0x3f0 negative objects to delete
  nr=-9232265467809300450
  ...

Tracing shows the freeable(mb_cache_count returns) is -1, which causes
the continuous accumulation and overflow of total_scan.

This patch makes sure that mb_cache_count() cannot return a negative
value, which makes the mbcache shrinker more robust.

Link: http://lkml.kernel.org/r/1511753419-52328-1-git-send-email-jiang.biao2@zte.com.cnSigned-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zhong.weidong@zte.com.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d5dabd63

exec: avoid RLIMIT_STACK races with prlimit() · 04e35f44

由 Kees Cook 提交于 11月 29, 2017

While the defense-in-depth RLIMIT_STACK limit on setuid processes was
protected against races from other threads calling setrlimit(), I missed
protecting it against races from external processes calling prlimit().
This adds locking around the change and makes sure that rlim_max is set
too.

Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
Fixes: 64701dee ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: NKees Cook <keescook@chromium.org>
Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: NBrad Spengler <spender@grsecurity.net>
Acked-by: NSerge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

04e35f44

mm: replace pmd_write with pmd_access_permitted in fault + gup paths · c7da82b8

由 Dan Williams 提交于 11月 29, 2017

The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:

 - validate that the mapping is writable from a protection keys
   standpoint

 - validate that the pte has _PAGE_USER set since all fault paths where
   pmd_write is must be referencing user-memory.

Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c7da82b8

NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid" · 445f288d

由 Trond Myklebust 提交于 11月 18, 2017

gcc 4.4.4 is too old to have full C11 anonymous union support, so
the current initialiser fails to compile.
Reported-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
(compile-)Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

445f288d

29 11月, 2017 6 次提交

quota: Check for register_shrinker() failure. · 88bc0ede

由 Tetsuo Handa 提交于 11月 29, 2017

register_shrinker() might return -ENOMEM error since Linux 3.12.
Call panic() as with other failure checks in this function if
register_shrinker() failed.

Fixes: 1d3d4437 ("vmscan: per-node deferred work")
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jan Kara <jack@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NJan Kara <jack@suse.cz>

88bc0ede

xfs: calculate correct offset in xfs_scrub_quota_item · 712d361d

由 Eric Sandeen 提交于 11月 27, 2017

It's only used for tracepoints so it's relatively harmless,
but the offset is calculated incorrectly in xfs_scrub_quota_item.

qi_dqperchunk is the nr. of dquots per "chunk" which we have
conveniently *cough* defined to always be 1 FSB.  Therefore
block_offset * qi_dqperchunk == first id in that chunk,
and so offset = id / qi_dqperchunk

id * dqperchunk is ... meaningless.

Fixes-coverity-id: 1423965
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

712d361d

xfs: fix uninitialized variable in xfs_scrub_quota · eda6bc27

由 Eric Sandeen 提交于 11月 27, 2017

On the first pass through the while(1) loop, we get to
xfs_scrub_should_terminate() which can test the uninitialized
error variable.

Fixes-coverity-id: 1423737
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

eda6bc27

xfs: fix leaks on corruption errors in xfs_bmap.c · d41c6172

由 Eric Sandeen 提交于 11月 27, 2017

Use _GOTO instead of _RETURN so we can free the allocated
cursor on error.

Fixes: bf806280 ("xfs: remove xfs_bmse_shift_one")
Fixes-coverity-id: 1423813, 1423676
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

d41c6172

xfs: fortify xfs_alloc_buftarg error handling · d210a987

由 Michal Hocko 提交于 11月 23, 2017

percpu_counter_init failure path doesn't clean up &btp->bt_lru list.
Call list_lru_destroy in that error path. Similarly register_shrinker
error path is not handled.

While it is unlikely to trigger these error path, it is not impossible
especially the later might fail with large NUMAs.  Let's handle the
failure to make the code more robust.
Noticed-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

d210a987

Btrfs: incremental send, fix wrong unlink path after renaming file · ea37d599

由 Filipe Manana 提交于 11月 17, 2017

Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:

Parent snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- b/                                         (ino 259)
 |     |     |---- c/                                   (ino 260)
 |     |     |---- f2                                   (ino 261)
 |     |
 |     |---- f2l1                                       (ino 261)
 |
 |---- d/                                               (ino 262)
       |---- f1l1_2                                     (ino 258)
       |---- f2l2                                       (ino 261)
       |---- f1_2                                       (ino 258)

Send snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- f2l1/                                      (ino 263)
 |             |---- b2/                                (ino 259)
 |                   |---- c/                           (ino 260)
 |                   |     |---- d3                     (ino 262)
 |                   |           |---- f1l1_2           (ino 258)
 |                   |           |---- f2l2_2           (ino 261)
 |                   |           |---- f1_2             (ino 258)
 |                   |
 |                   |---- f2                           (ino 261)
 |                   |---- f1l2                         (ino 258)
 |
 |---- d                                                (ino 261)

When computing the incremental send stream the following steps happen:

1) When processing inode 261, a rename operation is issued that renames
   inode 262, which currently as a path of "d", to an orphan name of
   "o262-7-0". This is done because in the send snapshot, inode 261 has
   of its hard links with a path of "d" as well.

2) Two link operations are issued that create the new hard links for
   inode 261, whose names are "d" and "f2l2_2", at paths "/" and
   "o262-7-0/" respectively.

3) Still while processing inode 261, unlink operations are issued to
   remove the old hard links of inode 261, with names "f2l1" and "f2l2",
   at paths "a/" and "d/". However path "d/" does not correspond anymore
   to the directory inode 262 but corresponds instead to a hard link of
   inode 261 (link command issued in the previous step). This makes the
   receiver fail with a ENOTDIR error when attempting the unlink
   operation.

The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.

A test case for fstests follows soon.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ea37d599

28 11月, 2017 14 次提交

quota: propagate error from __dquot_initialize · 1a6152d3

由 Chao Yu 提交于 11月 28, 2017

In commit 6184fc0b ("quota: Propagate error from ->acquire_dquot()"),
we have propagated error from __dquot_initialize to caller, but we forgot
to handle such error in add_dquot_ref(), so, currently, during quota
accounting information initialization flow, if we failed for some of
inodes, we just ignore such error, and do account for others, which is
not a good implementation.

In this patch, we choose to let user be aware of such error, so after
turning on quota successfully, we can make sure all inodes disk usage
can be accounted, which will be more reasonable.
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NChao Yu <yuchao0@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>

1a6152d3

btrfs: tree-checker: Fix false panic for sanity test · 69fc6cbb

由 Qu Wenruo 提交于 11月 08, 2017

[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:

------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
 btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
 setup_items_for_insert+0x385/0x650 [btrfs]
 __btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----

[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.

However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.

This makes tree-checker catch uninitialized data as error, causing
such panic.

*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()

[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.

So we can still get early warning if we screw up item pointers, and
avoid false panic.

Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: NLakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

69fc6cbb

proc: don't report kernel addresses in /proc/<pid>/stack · 8f5abe84

由 Linus Torvalds 提交于 11月 27, 2017

This just changes the file to report them as zero, although maybe even
that could be removed.  I checked, and at least procps doesn't actually
seem to parse the 'stack' file at all.

And since the file doesn't necessarily even exist (it requires
CONFIG_STACKTRACE), possibly other tools don't really use it either.

That said, in case somebody parses it with tools, just having that zero
there should keep such tools happy.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8f5abe84

lockd: fix "list_add double add" caused by legacy signal interface · 81833de1

由 Vasily Averin 提交于 11月 13, 2017

restart_grace() uses hardcoded init_net.
It can cause to "list_add double add" in following scenario:

1) nfsd and lockd was started in several net namespaces
2) nfsd in init_net was stopped (lockd was not stopped because
 it have users from another net namespaces)
3) lockd got signal, called restart_grace() -> set_grace_period()
 and enabled lock_manager in hardcoded init_net.
4) nfsd in init_net is started again,
 its lockd_up() calls set_grace_period() and tries to add
 lock_manager into init_net 2nd time.

Jeff Layton suggest:
"Make it safe to call locks_start_grace multiple times on the same
lock_manager. If it's already on the global grace_list, then don't try
to add it again.  (But we don't intentionally add twice, so for now we
WARN about that case.)

With this change, we also need to ensure that the nfsd4 lock manager
initializes the list before we call locks_start_grace. While we're at
it, move the rest of the nfsd_net initialization into
nfs4_state_create_net. I see no reason to have it spread over two
functions like it is today."

Suggested patch was updated to generate warning in described situation.
Suggested-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

81833de1

nlm_shutdown_hosts_net() cleanup · 9e137ed5

由 Vasily Averin 提交于 10月 30, 2017

nlm_complain_hosts() walks through nlm_server_hosts hlist, which should
be protected by nlm_host_mutex.
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

9e137ed5

race of nfsd inetaddr notifiers vs nn->nfsd_serv change · 2317dc55

由 Vasily Averin 提交于 11月 10, 2017

nfsd_inet[6]addr_event uses nn->nfsd_serv without taking nfsd_mutex,
which can be changed during execution of notifiers and crash the host.

Moreover if notifiers were enabled in one net namespace they are enabled
in all other net namespaces, from creation until destruction.

This patch allows notifiers to access nn->nfsd_serv only after the
pointer is correctly initialized and delays cleanup until notifiers are
no longer in use.
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Tested-by: NScott Mayhew <smayhew@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

2317dc55

race of lockd inetaddr notifiers vs nlmsvc_rqst change · 6b18dd1c

由 Vasily Averin 提交于 11月 10, 2017

lockd_inet[6]addr_event use nlmsvc_rqst without taken nlmsvc_mutex,
nlmsvc_rqst can be changed during execution of notifiers and crash the host.

Patch enables access to nlmsvc_rqst only when it was correctly initialized
and delays its cleanup until notifiers are no longer in use.

Note that nlmsvc_rqst can be temporally set to ERR_PTR, so the "if
(nlmsvc_rqst)" check in notifiers is insufficient on its own.
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Tested-by: NScott Mayhew <smayhew@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

6b18dd1c

NFSD: make cache_detail structures const · ae2e408e

由 Bhumika Goyal 提交于 10月 17, 2017

Make these const as they are only getting passed to the function
cache_create_net having the argument as const.
Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ae2e408e

nfsd: check for use of the closed special stateid · ae254dac

由 Andrew Elble 提交于 11月 09, 2017

Prevent the use of the closed (invalid) special stateid by clients.
Signed-off-by: NAndrew Elble <aweits@rit.edu>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ae254dac

nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat · 64ebe124

由 Naofumi Honda 提交于 11月 09, 2017

From kernel 4.9, my two nfsv4 servers sometimes suffer from
    "panic: unable to handle kernel page request"
in posix_unblock_lock() called from nfs4_laundromat().

These panics diseappear if we revert the commit "nfsd: add a LRU list
for blocked locks".

The cause appears to be a typo in nfs4_laundromat(), which is also
present in nfs4_state_shutdown_net().

Cc: stable@vger.kernel.org
Fixes: 7919d0a2 "nfsd: add a LRU list for blocked locks"
Cc: jlayton@redhat.com
Reveiwed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

64ebe124

lockd: lost rollback of set_grace_period() in lockd_down_net() · 3a2b19d1

由 Vasily Averin 提交于 11月 02, 2017

Commit efda760f ("lockd: fix lockd shutdown race") is incorrect,
it removes lockd_manager and disarm grace_period_end for init_net only.

If nfsd was started from another net namespace lockd_up_net() calls
set_grace_period() that adds lockd_manager into per-netns list
and queues grace_period_end delayed work.

These action should be reverted in lockd_down_net().
Otherwise it can lead to double list_add on after restart nfsd in netns,
and to use-after-free if non-disarmed delayed work will be executed after netns destroy.

Fixes: efda760f ("lockd: fix lockd shutdown race")
Cc: stable@vger.kernel.org
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

3a2b19d1

lockd: added cleanup checks in exit_net hook · a3152f14

由 Vasily Averin 提交于 11月 06, 2017

Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

a3152f14

V
grace: replace BUG_ON by WARN_ONCE in exit_net hook · b8722857
由 Vasily Averin 提交于 11月 06, 2017
```
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
```
b8722857

nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class · 4f34bd05

由 Andrew Elble 提交于 11月 08, 2017

The use of the st_mutex has been confusing the validator. Use the
proper nested notation so as to not produce warnings.
Signed-off-by: NAndrew Elble <aweits@rit.edu>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

4f34bd05

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功