1. 12 4月, 2018 14 次提交
    • A
      ovl: allocate anon bdev per unique lower fs · 5148626b
      Amir Goldstein 提交于
      Instead of allocating an anonymous bdev per lower layer, allocate
      one anonymous bdev per every unique lower fs that is different than
      upper fs.
      
      Every unique lower fs is assigned an fsid > 0 and the number of
      unique lower fs are stored in ofs->numlowerfs.
      
      The assigned fsid is stored in the lower layer struct and will be
      used also for inode number multiplexing.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      5148626b
    • A
      ovl: factor out ovl_map_dev_ino() helper · da309e8c
      Amir Goldstein 提交于
      A helper for ovl_getattr() to map the values of st_dev and st_ino
      according to constant st_ino rules.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      da309e8c
    • M
      ovl: cleanup ovl_update_time() · 8f35cf51
      Miklos Szeredi 提交于
      No need to mess with an alias, the upperdentry can be retrieved directly
      from the overlay inode.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      8f35cf51
    • M
      ovl: add WARN_ON() for non-dir redirect cases · 3a291774
      Miklos Szeredi 提交于
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3a291774
    • V
      ovl: cleanup setting OVL_INDEX · 0471a9cd
      Vivek Goyal 提交于
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      0471a9cd
    • V
      ovl: set d->is_dir and d->opaque for last path element · 102b0d11
      Vivek Goyal 提交于
      Certain properties in ovl_lookup_data should be set only for the last
      element of the path. IOW, if we are calling ovl_lookup_single() for an
      absolute redirect, then d->is_dir and d->opaque do not make much sense
      for intermediate path elements. Instead set them only if dentry being
      lookup is last path element.
      
      As of now we do not seem to be making use of d->opaque if it is set for
      a path/dentry in lower. But just define the semantics so that future code
      can make use of this assumption.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      102b0d11
    • V
      ovl: Do not check for redirect if this is last layer · e9b77f90
      Vivek Goyal 提交于
      If we are looking in last layer, then there should not be any need to
      process redirect. redirect information is used only for lookup in next
      lower layer and there is no more lower layer to look into. So no need
      to process redirects.
      
      IOW, ignore redirects on lowest layer.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e9b77f90
    • A
      ovl: lookup in inode cache first when decoding lower file handle · 8b58924a
      Amir Goldstein 提交于
      When decoding a lower file handle, we need to check if lower file was
      copied up and indexed and if it has a whiteout index, we need to check
      if this is an unlinked but open non-dir before returning -ESTALE.
      
      To find out if this is an unlinked but open non-dir we need to lookup
      an overlay inode in inode cache by lower inode and that requires decoding
      the lower file handle before looking in inode cache.
      
      Before this change, if the lower inode turned out to be a directory, we
      may have paid an expensive cost to reconnect that lower directory for
      nothing.
      
      After this change, we start by decoding a disconnected lower dentry and
      using the lower inode for looking up an overlay inode in inode cache.
      If we find overlay inode and dentry in cache, we avoid the index lookup
      overhead. If we don't find an overlay inode and dentry in cache, then we
      only need to decode a connected lower dentry in case the lower dentry is
      a non-indexed directory.
      
      The xfstests group overlay/exportfs tests decoding overlayfs file
      handles after drop_caches with different states of the file at encode
      and decode time. Overall the tests in the group call ovl_lower_fh_to_d()
      89 times to decode a lower file handle.
      
      Before this change, the tests called ovl_get_index_fh() 75 times and
      reconnect_one() 61 times.
      After this change, the tests call ovl_get_index_fh() 70 times and
      reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided
      are cases where a non-upper directory file handle was encoded, then the
      directory removed and then file handle was decoded.
      
      To demonstrate the affect on decoding file handles with hot inode/dentry
      cache, the drop_caches call in the tests was disabled. Without
      drop_caches, there are no reconnect_one() calls at all before or after
      the change. Before the change, there are 75 calls to ovl_get_index_fh(),
      exactly as the case with drop_caches. After the change, there are only
      10 calls to ovl_get_index_fh().
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      8b58924a
    • A
      ovl: do not try to reconnect a disconnected origin dentry · 8a22efa1
      Amir Goldstein 提交于
      On lookup of non directory, we try to decode the origin file handle
      stored in upper inode. The origin file handle is supposed to be decoded
      to a disconnected non-dir dentry, which is fine, because we only need
      the lower inode of a copy up origin.
      
      However, if the origin file handle somehow turns out to be a directory
      we pay the expensive cost of reconnecting the directory dentry, only to
      get a mismatch file type and drop the dentry.
      
      Optimize this case by explicitly opting out of reconnecting the dentry.
      Opting-out of reconnect is done by passing a NULL acceptable callback
      to exportfs_decode_fh().
      
      While the case described above is a strange corner case that does not
      really need to be optimized, the API added for this optimization will
      be used by a following patch to optimize a more common case of decoding
      an overlayfs file handle.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      8a22efa1
    • A
      ovl: disambiguate ovl_encode_fh() · 5b2cccd3
      Amir Goldstein 提交于
      Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
      exportfs function ovl_encode_inode_fh() and change the latter to
      ovl_encode_fh() to match the exportfs method name.
      
      Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      5b2cccd3
    • A
      ovl: set lower layer st_dev only if setting lower st_ino · 9f99e50d
      Amir Goldstein 提交于
      For broken hardlinks, we do not return lower st_ino, so we should
      also not return lower pseudo st_dev.
      
      Fixes: a0c5ad30 ("ovl: relax same fs constraint for constant st_ino")
      Cc: <stable@vger.kernel.org> #v4.15
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9f99e50d
    • A
      ovl: fix lookup with middle layer opaque dir and absolute path redirects · 3ec9b3fa
      Amir Goldstein 提交于
      As of now if we encounter an opaque dir while looking for a dentry, we set
      d->last=true. This means that there is no need to look further in any of
      the lower layers. This works fine as long as there are no redirets or
      relative redircts. But what if there is an absolute redirect on the
      children dentry of opaque directory. We still need to continue to look into
      next lower layer. This patch fixes it.
      
      Here is an example to demonstrate the issue. Say you have following setup.
      
      upper:  /redirect (redirect=/a/b/c)
      lower1: /a/[b]/c       ([b] is opaque) (c has absolute redirect=/a/b/d/)
      lower0: /a/b/d/foo
      
      Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d.
      Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look
      into lower0 because children c has an absolute redirect.
      
      Following is a reproducer.
      
      Watch me make foo disappear:
      
       $ mkdir lower middle upper work work2 merged
       $ mkdir lower/origin
       $ touch lower/origin/foo
       $ mount -t overlay none merged/ \
               -olowerdir=lower,upperdir=middle,workdir=work2
       $ mkdir merged/pure
       $ mv merged/origin merged/pure/redirect
       $ umount merged
       $ mount -t overlay none merged/ \
               -olowerdir=middle:lower,upperdir=upper,workdir=work
       $ mv merged/pure/redirect merged/redirect
      
      Now you see foo inside a twice redirected merged dir:
      
       $ ls merged/redirect
       foo
       $ umount merged
       $ mount -t overlay none merged/ \
               -olowerdir=middle:lower,upperdir=upper,workdir=work
      
      After mount cycle you don't see foo inside the same dir:
      
       $ ls merged/redirect
      
      During middle layer lookup, the opaqueness of middle/pure is left in
      the lookup state and then middle/pure/redirect is wrongly treated as
      opaque.
      
      Fixes: 02b69b28 ("ovl: lookup redirects")
      Cc: <stable@vger.kernel.org> #v4.10
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3ec9b3fa
    • V
      ovl: Set d->last properly during lookup · 452061fd
      Vivek Goyal 提交于
      d->last signifies that this is the last layer we are looking into and there
      is no more. And that means this allows for some optimzation opportunities
      during lookup. For example, in ovl_lookup_single() we don't have to check
      for opaque xattr of a directory is this is the last layer we are looking
      into (d->last = true).
      
      But knowing for sure whether we are looking into last layer can be very
      tricky. If redirects are not enabled, then we can look at poe->numlower and
      figure out if the lookup we are about to is last layer or not. But if
      redircts are enabled then it is possible poe->numlower suggests that we are
      looking in last layer, but there is an absolute redirect present in found
      element and that redirects us to a layer in root and that means lookup will
      continue in lower layers further.
      
      For example, consider following.
      
      /upperdir/pure (opaque=y)
      /upperdir/pure/foo (opaque=y,redirect=/bar)
      /lowerdir/bar
      
      In this case pure is "pure upper". When we look for "foo", that time
      poe->numlower=0. But that alone does not mean that we will not search for a
      merge candidate in /lowerdir. Absolute redirect changes that.
      
      IOW, d->last should not be set just based on poe->numlower if redirects are
      enabled. That can lead to setting d->last while it should not have and that
      means we will not check for opaque xattr while we should have.
      
      So do this.
      
       - If redirects are not enabled, then continue to rely on poe->numlower
         information to determine if it is last layer or not.
      
       - If redirects are enabled, then set d->last = true only if this is the
         last layer in root ovl_entry (roe).
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 02b69b28 ("ovl: lookup redirects")
      Cc: <stable@vger.kernel.org> #v4.10
      452061fd
    • A
      ovl: set i_ino to the value of st_ino for NFS export · 695b46e7
      Amir Goldstein 提交于
      Eddie Horng reported that readdir of an overlayfs directory that
      was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN.
      The reason is that while preparing the response for readdirplus, nfsd
      checks inside encode_entryplus_baggage() that a child dentry's inode
      number matches the value of d_ino returns by overlayfs readdir iterator.
      
      Because the overlayfs inodes use arbitrary inode numbers that are not
      correlated with the values of st_ino/d_ino, NFSv3 falls back to not
      encoding d_type. Although this is an allowed behavior, we can fix it for
      the case of all overlayfs layers on the same underlying filesystem.
      
      When NFS export is enabled and d_ino is consistent with st_ino
      (samefs), set the same value also to i_ino in ovl_fill_inode() for all
      overlayfs inodes, nfsd readdirplus sanity checks will pass.
      ovl_fill_inode() may be called from ovl_new_inode(), before real inode
      was created with ino arg 0. In that case, i_ino will be updated to real
      upper inode i_ino on ovl_inode_init() or ovl_inode_update().
      Reported-by: NEddie Horng <eddiehorng.tw@gmail.com>
      Tested-by: NEddie Horng <eddiehorng.tw@gmail.com>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: 8383f174 ("ovl: wire up NFS export operations")
      Cc: <stable@vger.kernel.org> #v4.16
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      695b46e7
  2. 23 3月, 2018 1 次提交
  3. 20 3月, 2018 2 次提交
  4. 16 3月, 2018 2 次提交
    • D
      Revert "btrfs: use proper endianness accessors for super_copy" · 093e037c
      David Sterba 提交于
      This reverts commit 3c181c12.
      
      The offending patch was merged in 4.16-rc4 and was promptly applied to
      stable kernels 4.14.25 and 4.15.8.
      
      The patch causes a corruption in several superblock items on big-endian
      machines because of messed up endianity conversions. The damage is
      manually repairable. A filesystem cannot be mounted again after it has
      been unmounted once.
      
      We do a full revert and not a fixup so stable can pick that patch ASAP.
      
      Fixes: 3c181c12 ("btrfs: use proper endianness accessors for super_copy")
      Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      093e037c
    • E
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman 提交于
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  5. 15 3月, 2018 4 次提交
    • E
      btrfs: add missing initialization in btrfs_check_shared · 18bf591b
      Edmund Nadolski 提交于
      This patch addresses an issue that causes fiemap to falsely
      report a shared extent.  The test case is as follows:
      
      xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
      sync
      xfs_io  -c "fiemap -v" /media/scratch/file5
      
      which gives the resulting output:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128 0x2001
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      This is because btrfs_check_shared calls find_parent_nodes
      repeatedly in a loop, passing a share_check struct to report
      the count of shared extent. But btrfs_check_shared does not
      re-initialize the count value to zero for subsequent calls
      from the loop, resulting in a false share count value. This
      is a regressive behavior from 4.13.
      
      With proper re-initialization the test result is as follows:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      which corrects the regression.
      
      Fixes: 3ec4d323 ("btrfs: allow backref search checks for shared extents")
      Signed-off-by: NEdmund Nadolski <enadolski@suse.com>
      [ add text from cover letter to changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bf591b
    • D
      btrfs: Fix NULL pointer exception in find_bio_stripe · 047fdea6
      Dmitriy Gorokh 提交于
      On detaching of a disk which is a part of a RAID6 filesystem, the
      following kernel OOPS may happen:
      
      [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
      [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
      [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
      [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
      [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
      [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
      [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
      [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
      [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
      [63122.971202] Oops: 0000 [#1] SMP
      [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
      [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
      [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
      [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
      [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
      [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
      [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
      [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
      [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
      [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
      [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
      [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
      [63123.009969] Call Trace:
      [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
      [63123.010251] bio_endio+0xa1/0x120
      [63123.010378] generic_make_request+0x218/0x270
      [63123.010921] submit_bio+0x66/0x130
      [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
      [63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
      [63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
      [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
      [63123.011759] ? ___cache_free+0x1c5/0x300
      [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
      [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
      [63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
      [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
      [63123.012656] process_one_work+0x14d/0x350
      [63123.012888] worker_thread+0x4d/0x3a0
      [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
      [63123.013192] kthread+0x109/0x140
      [63123.013315] ? process_scheduled_works+0x40/0x40
      [63123.013472] ? kthread_stop+0x110/0x110
      [63123.013610] ret_from_fork+0x25/0x30
      [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
      [63123.014678] CR2: 0000000000000080
      [63123.016590] ---[ end trace a295ea7259c17880 ]—
      
      This is reproducible in a cycle, where a series of writes is followed by
      SCSI device delete command. The test may take up to few minutes.
      
      Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
      [ no signed-off-by provided ]
      Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      047fdea6
    • T
      fs/aio: Use RCU accessors for kioctx_table->table[] · d0264c01
      Tejun Heo 提交于
      While converting ioctx index from a list to a table, db446a08
      ("aio: convert the ioctx list to table lookup v3") missed tagging
      kioctx_table->table[] as an array of RCU pointers and using the
      appropriate RCU accessors.  This introduces a small window in the
      lookup path where init and access may race.
      
      Mark kioctx_table->table[] with __rcu and use the approriate RCU
      accessors when using the field.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: db446a08 ("aio: convert the ioctx list to table lookup v3")
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.12+
      d0264c01
    • T
      fs/aio: Add explicit RCU grace period when freeing kioctx · a6d7cff4
      Tejun Heo 提交于
      While fixing refcounting, e34ecee2 ("aio: Fix a trinity splat")
      incorrectly removed explicit RCU grace period before freeing kioctx.
      The intention seems to be depending on the internal RCU grace periods
      of percpu_ref; however, percpu_ref uses a different flavor of RCU,
      sched-RCU.  This can lead to kioctx being freed while RCU read
      protected dereferences are still in progress.
      
      Fix it by updating free_ioctx() to go through call_rcu() explicitly.
      
      v2: Comment added to explain double bouncing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: e34ecee2 ("aio: Fix a trinity splat")
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.13+
      a6d7cff4
  6. 09 3月, 2018 3 次提交
  7. 08 3月, 2018 1 次提交
  8. 07 3月, 2018 1 次提交
  9. 02 3月, 2018 3 次提交
  10. 01 3月, 2018 9 次提交
    • C
      ceph: fix potential memory leak in init_caches() · 1c789249
      Chengguang Xu 提交于
      There is lack of cache destroy operation for ceph_file_cachep
      when failing from fscache register.
      Signed-off-by: NChengguang Xu <cgxu519@icloud.com>
      Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      1c789249
    • F
      Btrfs: fix log replay failure after unlink and link combination · 1f250e92
      Filipe Manana 提交于
      If we have a file with 2 (or more) hard links in the same directory,
      remove one of the hard links, create a new file (or link an existing file)
      in the same directory with the name of the removed hard link, and then
      finally fsync the new file, we end up with a log that fails to replay,
      causing a mount failure.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/foo
        $ ln /mnt/testdir/foo /mnt/testdir/bar
      
        $ sync
      
        $ unlink /mnt/testdir/bar
        $ touch /mnt/testdir/bar
        $ xfs_io -c "fsync" /mnt/testdir/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount(2) failed: /mnt: No such file or directory
      
      When replaying the log, for that example, we also see the following in
      dmesg/syslog:
      
        [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
        [71813.674204] ------------[ cut here ]------------
        [71813.675694] BTRFS: Transaction aborted (error -2)
        [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
        [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
        [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
        [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
        [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
        [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
        [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
        [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
        [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
        [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
        [71813.679669] Call Trace:
        [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
        [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
        [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
        [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
        [71813.679669]  ? rcu_read_unlock+0x3a/0x57
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
        [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
        [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
        [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
        [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
        [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  do_mount+0x6e5/0x973
        [71813.679669]  ? memdup_user+0x3e/0x5c
        [71813.679669]  SyS_mount+0x72/0x98
        [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
        [71813.679669] RIP: 0033:0x7f7cedf150ba
        [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
        [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
        [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
        [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
        [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
        [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
        [71814.128078] BTRFS error (device dm-0): open_ctree failed
      
      This happens because the log has inode reference items for both inode 258
      (the first file we created) and inode 259 (the second file created), and
      when processing the reference item for inode 258, we replace the
      corresponding item in the subvolume tree (which has two names, "foo" and
      "bar") witht he one in the log (which only has one name, "foo") without
      removing the corresponding dir index keys from the parent directory.
      Later, when processing the inode reference item for inode 259, which has
      a name of "bar" associated to it, we notice that dir index entries exist
      for that name and for a different inode, so we attempt to unlink that
      name, which fails because the inode reference item for inode 258 no longer
      has the name "bar" associated to it, making a call to btrfs_unlink_inode()
      fail with a -ENOENT error.
      
      Fix this by unlinking all the names in an inode reference item from a
      subvolume tree that are not present in the inode reference item found in
      the log tree, before overwriting it with the item from the log tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f250e92
    • F
      Btrfs: fix log replay failure after linking special file and fsync · 9a6509c4
      Filipe Manana 提交于
      If in the same transaction we rename a special file (fifo, character/block
      device or symbolic link), create a hard link for it having its old name
      then sync the log, we will end up with a log that can not be replayed and
      at when attempting to replay it, an EEXIST error is returned and mounting
      the filesystem fails. Example scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/testdir
        $ mkfifo /mnt/testdir/foo
        # Make sure everything done so far is durably persisted.
        $ sync
      
        # Create some unrelated file and fsync it, this is just to create a log
        # tree. The file must be in the same directory as our special file.
        $ touch /mnt/testdir/f1
        $ xfs_io -c "fsync" /mnt/testdir/f1
      
        # Rename our special file and then create a hard link with its old name.
        $ mv /mnt/testdir/foo /mnt/testdir/bar
        $ ln /mnt/testdir/bar /mnt/testdir/foo
      
        # Create some other unrelated file and fsync it, this is just to persist
        # the log tree which was modified by the previous rename and link
        # operations. Alternatively we could have modified file f1 and fsync it.
        $ touch /mnt/f2
        $ xfs_io -c "fsync" /mnt/f2
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        mount: mount /dev/sdc on /mnt failed: File exists
      
      This happens because when both the log tree and the subvolume's tree have
      an entry in the directory "testdir" with the same name, that is, there
      is one key (258 INODE_REF 257) in the subvolume tree and another one in
      the log tree (where 258 is the inode number of our special file and 257
      is the inode for directory "testdir"). Only the data of those two keys
      differs, in the subvolume tree the index field for inode reference has
      a value of 3 while the log tree it has a value of 5. Because the same key
      exists in both trees, but have different index, the log replay fails with
      an -EEXIST error when attempting to replay the inode reference from the
      log tree.
      
      Fix this by setting the last_unlink_trans field of the inode (our special
      file) to the current transaction id when a hard link is created, as this
      forces logging the parent directory inode, solving the conflict at log
      replay time.
      
      A new generic test case for fstests was also submitted.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a6509c4
    • F
      Btrfs: send, fix issuing write op when processing hole in no data mode · d4dfc0f4
      Filipe Manana 提交于
      When doing an incremental send of a filesystem with the no-holes feature
      enabled, we end up issuing a write operation when using the no data mode
      send flag, instead of issuing an update extent operation. Fix this by
      issuing the update extent operation instead.
      
      Trivial reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdc
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdc /mnt/sdc
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
      
        $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
      
        $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
        $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
             | btrfs receive -vv /mnt/sdd
      
      Before this change the output of the second receive command is:
      
        receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
        utimes
        write foobar, offset 8192, len 8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
      
      After this change it is:
      
        receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
        utimes
        update_extent foobar: offset=8192, len=8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4dfc0f4
    • A
      btrfs: use proper endianness accessors for super_copy · 3c181c12
      Anand Jain 提交于
      The fs_info::super_copy is a byte copy of the on-disk structure and all
      members must use the accessor macros/functions to obtain the right
      value.  This was missing in update_super_roots and in sysfs readers.
      
      Moving between opposite endianness hosts will report bogus numbers in
      sysfs, and mount may fail as the root will not be restored correctly. If
      the filesystem is always used on a same endian host, this will not be a
      problem.
      
      Fix this by using the btrfs_set_super...() functions to set
      fs_info::super_copy values, and for the sysfs, use the cached
      fs_info::nodesize/sectorsize values.
      
      CC: stable@vger.kernel.org
      Fixes: df93589a ("btrfs: export more from FS_INFO to sysfs")
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c181c12
    • H
      btrfs: alloc_chunk: fix DUP stripe size handling · 92e222df
      Hans van Kranenburg 提交于
      In case of using DUP, we search for enough unallocated disk space on a
      device to hold two stripes.
      
      The devices_info[ndevs-1].max_avail that holds the amount of unallocated
      space found is directly assigned to stripe_size, while it's actually
      twice the stripe size.
      
      Later on in the code, an unconditional division of stripe_size by
      dev_stripes corrects the value, but in the meantime there's a check to
      see if the stripe_size does not exceed max_chunk_size. Since during this
      check stripe_size is twice the amount as intended, the check will reduce
      the stripe_size to max_chunk_size if the actual correct to be used
      stripe_size is more than half the amount of max_chunk_size.
      
      The unconditional division later tries to correct stripe_size, but will
      actually make sure we can't allocate more than half the max_chunk_size.
      
      Fix this by moving the division by dev_stripes before the max chunk size
      check, so it always contains the right value, instead of putting a duct
      tape division in further on to get it fixed again.
      
      Since in all other cases than DUP, dev_stripes is 1, this change only
      affects DUP.
      
      Other attempts in the past were made to fix this:
      * 37db63a4 "Btrfs: fix max chunk size check in chunk allocator" tried
      to fix the same problem, but still resulted in part of the code acting
      on a wrongly doubled stripe_size value.
      * 86db2578 "Btrfs: fix max chunk size on raid5/6" unintentionally
      broke this fix again.
      
      The real problem was already introduced with the rest of the code in
      73c5de00.
      
      The user visible result however will be that the max chunk size for DUP
      will suddenly double, while it's actually acting according to the limits
      in the code again like it was 5 years ago.
      Reported-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
      Fixes: 73c5de00 ("btrfs: quasi-round-robin for chunk allocation")
      Fixes: 86db2578 ("Btrfs: fix max chunk size on raid5/6")
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e222df
    • N
      btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster · 765f3ceb
      Nikolay Borisov 提交于
      Essentially duplicate the error handling from the above block which
      handles the !PageUptodate(page) case and additionally clear
      EXTENT_BOUNDARY.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      765f3ceb
    • N
      btrfs: handle failure of add_pending_csums · ac01f26a
      Nikolay Borisov 提交于
      add_pending_csums was added as part of the new data=ordered
      implementation in e6dcd2dc ("Btrfs: New data=ordered
      implementation"). Even back then it called the btrfs_csum_file_blocks
      which can fail but it never bothered handling the failure. In ENOMEM
      situation this could lead to the filesystem failing to write the
      checksums for a particular extent and not detect this. On read this
      could lead to the filesystem erroring out due to crc mismatch. Fix it by
      propagating failure from add_pending_csums and handling them.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac01f26a
    • J
      btrfs: use kvzalloc to allocate btrfs_fs_info · a8fd1f71
      Jeff Mahoney 提交于
      The srcu_struct in btrfs_fs_info scales in size with NR_CPUS.  On
      kernels built with NR_CPUS=8192, this can result in kmalloc failures
      that prevent mounting.
      
      There is work in progress to try to resolve this for every user of
      srcu_struct but using kvzalloc will work around the failures until
      that is complete.
      
      As an example with NR_CPUS=512 on x86_64: the overall size of
      subvol_srcu is 3460 bytes, fs_info is 6496.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8fd1f71