1. 13 12月, 2018 7 次提交
    • D
      xfs: const-ify xfs_owner_info arguments · 66e3237e
      Darrick J. Wong 提交于
      Only certain functions actually change the contents of an
      xfs_owner_info; the rest can accept a const struct pointer.  This will
      enable us to save stack space by hoisting static owner info types to
      be const global variables.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      66e3237e
    • D
      xfs: streamline defer op type handling · 02b100fb
      Darrick J. Wong 提交于
      There's no need to bundle a pointer to the defer op type into the defer
      op control structure.  Instead, store the defer op type enum, which
      enables us to shorten some of the lines.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      02b100fb
    • D
      xfs: idiotproof defer op type configuration · bc9f2b7c
      Darrick J. Wong 提交于
      Recently, we forgot to port a new defer op type to xfsprogs, which
      caused us some userspace pain.  Reorganize the way we make libxfs
      clients supply defer op type information so that all type information
      has to be provided at build time instead of risky runtime dynamic
      configuration.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      bc9f2b7c
    • D
      xfs: zero length symlinks are not valid · 43feeea8
      Dave Chinner 提交于
      A log recovery failure has been reproduced where a symlink inode has
      a zero length in extent form. It was caused by a shutdown during a
      combined fstress+fsmark workload.
      
      The underlying problem is the issue in xfs_inactive_symlink(): the
      inode is unlocked between the symlink inactivation/truncation and
      the inode being freed. This opens a window for the inode to be
      written to disk before it xfs_ifree() removes it from the unlinked
      list, marks it free in the inobt and zeros the mode.
      
      For shortform inodes, the fix is simple. xfs_ifree() clears the data
      fork state, so there's no need to do it in xfs_inactive_symlink().
      This means the shortform fork verifier will not see a zero length
      data fork as it mirrors the inode size through to xfs_ifree()), and
      hence if the inode gets written back and the fork verifiers are run
      they will still see a fork that matches the on-disk inode size.
      
      For extent form (remote) symlinks, it is a little more tricky. Here
      we explicitly set the inode size to zero, so the above race can lead
      to zero length symlinks on disk. Because the inode is unlinked at
      this point (i.e. on the unlinked list) and unreferenced, it can
      never be seen again by a user. Hence when we set the inode size to
      zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
      inodes to be of zero length, and so this avoids all the problems of
      zero length symlinks ever hitting the disk. It also avoids the
      problem of needing to handle zero length symlink inodes in log
      recovery to replay the extent free intents and the remaining
      deferops to free the extents the symlink used.
      
      Also add a couple of asserts to warn us if zero length symlinks end
      up in either the symlink create or inactivation paths.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      43feeea8
    • C
      xfs: clean up indentation issues, remove an unwanted space · 8c4ce794
      Colin Ian King 提交于
      There is a statement that has an unwanted space in the indentation.
      Remove it.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8c4ce794
    • P
      xfs: libxfs: move xfs_perag_put late · fe5ed6c2
      Pan Bian 提交于
      The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
      reference. However, pag->pagf_btreeblks is read and written after the
      put operation. This patch moves the put operation later.
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      [darrick: minor changelog edits]
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fe5ed6c2
    • D
      xfs: split up the xfs_reflink_end_cow work into smaller transactions · d6f215f3
      Darrick J. Wong 提交于
      In xfs_reflink_end_cow, we allocate a single transaction for the entire
      end_cow operation and then loop the CoW fork mappings to move them to
      the data fork.  This design fails on a heavily fragmented filesystem
      where an inode's data fork has exactly one more extent than would fit in
      an extents-format fork, because the unmap can collapse the data fork
      into extents format (freeing the bmbt block) but the remap can expand
      the data fork back into a (newly allocated) bmbt block.  If the number
      of extents we end up remapping is large, we can overflow the block
      reservation because we reserved blocks assuming that we were adding
      mappings into an already-cleared area of the data fork.
      
      Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
      and the data fork can hold at most 7 extents before needing to convert
      to btree format; and that blocks A-P are discontiguous single-block
      extents:
      
         0......7
      D: ABCDEFGH
      C: IJKLMNOP
      
      When a write to file blocks 0-7 completes, we must remap I-P into the
      data fork.  We start by removing H from the btree-format data fork.  Now
      we have 7 extents, so we convert the fork to extents format, freeing the
      bmbt block.   We then move P into the data fork and it now has 8 extents
      again.  We must convert the data fork back to btree format, requiring a
      block allocation.  If we repeat this sequence for blocks 6-5-4-3-2-1-0,
      we'll need a total of 8 block allocations to remap all 8 blocks.  We
      reserved only enough blocks to handle one btree split (5 blocks on a 4k
      block filesystem), which means we overflow the block reservation.
      
      To fix this issue, create a separate helper function to remap a single
      extent, and change _reflink_end_cow to call it in a tight loop over the
      entire range we're completing.  As a side effect this also removes the
      size restrictions on how many extents we can end_cow at a time, though
      nobody ever hit that.  It is not reasonable to reserve N blocks to remap
      N blocks.
      
      Note that this can be reproduced after ~320 million fsx ops while
      running generic/938 (long soak directio fsx exerciser):
      
      XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
      <machine registers snipped>
      Call Trace:
       xfs_trans_dup+0x211/0x250 [xfs]
       xfs_trans_roll+0x6d/0x180 [xfs]
       xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
       xfs_defer_finish_noroll+0xdf/0x740 [xfs]
       xfs_defer_finish+0x13/0x70 [xfs]
       xfs_reflink_end_cow+0x2c6/0x680 [xfs]
       xfs_dio_write_end_io+0x115/0x220 [xfs]
       iomap_dio_complete+0x3f/0x130
       iomap_dio_rw+0x3c3/0x420
       xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
       xfs_file_write_iter+0x8b/0xc0 [xfs]
       __vfs_write+0x193/0x1f0
       vfs_write+0xba/0x1c0
       ksys_write+0x52/0xc0
       do_syscall_64+0x50/0x160
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      d6f215f3
  2. 07 12月, 2018 1 次提交
  3. 06 12月, 2018 2 次提交
  4. 05 12月, 2018 6 次提交
    • M
      dax: Fix unlock mismatch with updated API · 27359fd6
      Matthew Wilcox 提交于
      Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
      store a replacement entry in the Xarray at the given xas-index with the
      DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
      value of the entry relative to the current Xarray state to be specified.
      
      In most contexts dax_unlock_entry() is operating in the same scope as
      the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
      case the implementation needs to recall the original entry. In the case
      where the original entry is a 'pmd' entry it is possible that the pfn
      performed to do the lookup is misaligned to the value retrieved in the
      Xarray.
      
      Change the api to return the unlock cookie from dax_lock_page() and pass
      it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
      assuming that the page was PMD-aligned if the entry was a PMD entry with
      signatures like:
      
       WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
       RIP: 0010:dax_insert_entry+0x2b2/0x2d0
       [..]
       Call Trace:
        dax_iomap_pte_fault.isra.41+0x791/0xde0
        ext4_dax_huge_fault+0x16f/0x1f0
        ? up_read+0x1c/0xa0
        __do_fault+0x1f/0x160
        __handle_mm_fault+0x1033/0x1490
        handle_mm_fault+0x18b/0x3d0
      
      Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
      Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray")
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      27359fd6
    • D
      iomap: partially revert 4721a601 (simulated directio short read on EFAULT) · 8f67b5ad
      Darrick J. Wong 提交于
      In commit 4721a601, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.  This causes infinite
      splice() loops and assertion failures on generic/095 on overlayfs
      because xfs only permit total success or total failure of a directio
      operation.  The underlying issue in the pipe splice code has now been
      fixed by changing the pipe splice loop to avoid avoid reading more data
      than there is space in the pipe.
      
      Therefore, it's no longer necessary to simulate the short directio, so
      remove the hack from iomap.
      
      Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8f67b5ad
    • D
      splice: don't read more than available pipe space · 17614445
      Darrick J. Wong 提交于
      In commit 4721a601, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.
      
      The brokenness is compounded by splice_direct_to_actor immediately
      bailing on do_splice_to returning <= 0 without ever calling ->actor
      (which empties out the pipe), so if userspace calls back we'll EFAULT
      again on the full pipe, and nothing ever gets copied.
      
      Therefore, teach splice_direct_to_actor to clamp its requests to the
      amount of free space in the pipe and remove the simulated short read
      from the iomap directio code.
      
      Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      17614445
    • D
      vfs: allow some remap flags to be passed to vfs_clone_file_range · 6744557b
      Darrick J. Wong 提交于
      In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
      lower filesystem's inode, passing through whatever remap flags it got
      from its caller.  Since vfs_copy_file_range first tries a filesystem's
      remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
      to the second vfs_copy_file_range call, and this isn't an issue.
      Change the WARN_ON to look only for the DEDUP flag.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      6744557b
    • E
      xfs: fix inverted return from xfs_btree_sblock_verify_crc · 7d048df4
      Eric Sandeen 提交于
      xfs_btree_sblock_verify_crc is a bool so should not be returning
      a failaddr_t; worse, if xfs_log_check_lsn fails it returns
      __this_address which looks like a boolean true (i.e. success)
      to the caller.
      
      (interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7d048df4
    • D
      xfs: fix PAGE_MASK usage in xfs_free_file_space · a579121f
      Darrick J. Wong 提交于
      In commit e53c4b59, I *tried* to teach xfs to force writeback when we
      fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
      the post-EOF part of the page gets zeroed before we return to userspace.
      Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
      which means that we totally fail to zero if we're fpunching and EOF is
      within the first page.  Worse yet, the same PAGE_MASK thinko plagues the
      filemap_write_and_wait_range call, so we'd initiate writeback of the
      entire file, which (mostly) masked the thinko.
      
      Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
      and the proper rounding macros.
      
      Fixes: e53c4b59 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a579121f
  5. 04 12月, 2018 2 次提交
    • R
      Revert "exec: make de_thread() freezable" · a72173ec
      Rafael J. Wysocki 提交于
      Revert commit c2239788 "exec: make de_thread() freezable" as
      requested by Ingo Molnar:
      
      "So there's a new regression in v4.20-rc4, my desktop produces this
      lockdep splat:
      
      [ 1772.588771] WARNING: pkexec/4633 still has locks held!
      [ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
      [ 1772.588775] ------------------------------------
      [ 1772.588776] 1 lock held by pkexec/4633:
      [ 1772.588778]  #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
      [ 1772.588786] stack backtrace:
      [ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
      [ 1772.588792] Call Trace:
      [ 1772.588800]  dump_stack+0x85/0xcb
      [ 1772.588803]  flush_old_exec+0x116/0x890
      [ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
      [ 1772.588809]  load_elf_binary+0x291/0x1620
      [ 1772.588815]  ? sched_clock+0x5/0x10
      [ 1772.588817]  ? search_binary_handler+0x6d/0x240
      [ 1772.588820]  search_binary_handler+0x80/0x240
      [ 1772.588823]  load_script+0x201/0x220
      [ 1772.588825]  search_binary_handler+0x80/0x240
      [ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
      [ 1772.588832]  ? strncpy_from_user+0x40/0x180
      [ 1772.588835]  __x64_sys_execve+0x34/0x40
      [ 1772.588838]  do_syscall_64+0x60/0x1c0
      
      The warning gets triggered by an ancient lockdep check in the freezer:
      
      (gdb) list *0xffffffff812ece06
      0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
      52	 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
      53	 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
      54	 */
      55	static inline bool try_to_freeze_unsafe(void)
      56	{
      57		might_sleep();
      58		if (likely(!freezing(current)))
      59			return false;
      60		return __refrigerator(false);
      61	}
      
      I reviewed the ->cred_guard_mutex code, and the mutex is held across all
      of exec() - and we always did this.
      
      But there's this recent -rc4 commit:
      
      > Chanho Min (1):
      >       exec: make de_thread() freezable
      
        c2239788: exec: make de_thread() freezable
      
      I believe this commit is bogus, you cannot call try_to_freeze() from
      de_thread(), because it's holding the ->cred_guard_mutex."
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a72173ec
    • Q
      btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable · 10950929
      Qu Wenruo 提交于
      [BUG]
      A completely valid btrfs will refuse to mount, with error message like:
        BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
          bg_start=12018974720 bg_len=10888413184, invalid block group size, \
          have 10888413184 expect (0, 10737418240]
      
      This has been reported several times as the 4.19 kernel is now being
      used. The filesystem refuses to mount, but is otherwise ok and booting
      4.18 is a workaround.
      
      Btrfs check returns no error, and all kernels used on this fs is later
      than 2011, which should all have the 10G size limit commit.
      
      [CAUSE]
      For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
      stripe stripe bump up.
      
      __btrfs_alloc_chunk()
      |- max_stripe_size = 1G
      |- max_chunk_size = 10G
      |- data_stripe = 11
      |- if (1G * 11 > 10G) {
             stripe_size = 976128930;
             stripe_size = round_up(976128930, SZ_16M) = 989855744
      
      However the final stripe_size (989855744) * 11 = 10888413184, which is
      still larger than 10G.
      
      [FIX]
      For the comprehensive check, we need to do the full check at chunk read
      time, and rely on bg <-> chunk mapping to do the check.
      
      We could just skip the length check for now.
      
      Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      10950929
  6. 02 12月, 2018 2 次提交
    • D
      nfs: don't dirty kernel pages read by direct-io · ad3cba22
      Dave Kleikamp 提交于
      When we use direct_IO with an NFS backing store, we can trigger a
      WARNING in __set_page_dirty(), as below, since we're dirtying the page
      unnecessarily in nfs_direct_read_completion().
      
      To fix, replicate the logic in commit 53cbf3b1 ("fs: direct-io:
      don't dirtying pages for ITER_BVEC/ITER_KVEC direct read").
      
      Other filesystems that implement direct_IO handle this; most use
      blockdev_direct_IO(). ceph and cifs have similar logic.
      
      mount 127.0.0.1:/export /nfs
      dd if=/dev/zero of=/nfs/image bs=1M count=200
      losetup --direct-io=on -f /nfs/image
      mkfs.btrfs /dev/loop0
      mount -t btrfs /dev/loop0 /mnt/
      
      kernel: WARNING: CPU: 0 PID: 8067 at fs/buffer.c:580 __set_page_dirty+0xaf/0xd0
      kernel: Modules linked in: loop(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) fuse(E) tun(E) ip6t_rpfilter(E) ipt_REJECT(E) nf_
      kernel:  snd_seq(E) snd_seq_device(E) snd_pcm(E) video(E) snd_timer(E) snd(E) soundcore(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) sr_mod(E) cdrom(E) ata_generic(E) pata_acpi(E) crc32c_intel(E) ahci(E) li
      kernel: CPU: 0 PID: 8067 Comm: kworker/0:2 Tainted: G            E     4.20.0-rc1.master.20181111.ol7.x86_64 #1
      kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      kernel: Workqueue: nfsiod rpc_async_release [sunrpc]
      kernel: RIP: 0010:__set_page_dirty+0xaf/0xd0
      kernel: Code: c3 48 8b 02 f6 c4 04 74 d4 48 89 df e8 ba 05 f7 ff 48 89 c6 eb cb 48 8b 43 08 a8 01 75 1f 48 89 d8 48 8b 00 a8 04 74 02 eb 87 <0f> 0b eb 83 48 83 e8 01 eb 9f 48 83 ea 01 0f 1f 00 eb 8b 48 83 e8
      kernel: RSP: 0000:ffffc1c8825b7d78 EFLAGS: 00013046
      kernel: RAX: 000fffffc0020089 RBX: fffff2b603308b80 RCX: 0000000000000001
      kernel: RDX: 0000000000000001 RSI: ffff9d11478115c8 RDI: ffff9d11478115d0
      kernel: RBP: ffffc1c8825b7da0 R08: 0000646f6973666e R09: 8080808080808080
      kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff9d11478115d0
      kernel: R13: ffff9d11478115c8 R14: 0000000000003246 R15: 0000000000000001
      kernel: FS:  0000000000000000(0000) GS:ffff9d115ba00000(0000) knlGS:0000000000000000
      kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      kernel: CR2: 00007f408686f640 CR3: 0000000104d8e004 CR4: 00000000000606f0
      kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      kernel: Call Trace:
      kernel:  __set_page_dirty_buffers+0xb6/0x110
      kernel:  set_page_dirty+0x52/0xb0
      kernel:  nfs_direct_read_completion+0xc4/0x120 [nfs]
      kernel:  nfs_pgio_release+0x10/0x20 [nfs]
      kernel:  rpc_free_task+0x30/0x70 [sunrpc]
      kernel:  rpc_async_release+0x12/0x20 [sunrpc]
      kernel:  process_one_work+0x174/0x390
      kernel:  worker_thread+0x4f/0x3e0
      kernel:  kthread+0x102/0x140
      kernel:  ? drain_workqueue+0x130/0x130
      kernel:  ? kthread_stop+0x110/0x110
      kernel:  ret_from_fork+0x35/0x40
      kernel: ---[ end trace 01341980905412c9 ]---
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      
      [forward-ported to v4.20]
      Signed-off-by: NCalum Mackay <calum.mackay@oracle.com>
      Reviewed-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      ad3cba22
    • T
      flexfiles: enforce per-mirror stateid only for v4 DSes · 320f35b7
      Tigran Mkrtchyan 提交于
      Since commit bb21ce0a we always enforce per-mirror stateid.
      However, this makes sense only for v4+ servers.
      Signed-off-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      320f35b7
  7. 01 12月, 2018 8 次提交
  8. 30 11月, 2018 6 次提交
    • N
      fscache: fix race between enablement and dropping of object · c5a94f43
      NeilBrown 提交于
      
      It was observed that a process blocked indefintely in
      __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
      to be cleared via fscache_wait_for_deferred_lookup().
      
      At this time, ->backing_objects was empty, which would normaly prevent
      __fscache_read_or_alloc_page() from getting to the point of waiting.
      This implies that ->backing_objects was cleared *after*
      __fscache_read_or_alloc_page was was entered.
      
      When an object is "killed" and then "dropped",
      FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
      KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
      ->backing_objects cleared.  This leaves a window where
      something else can set FSCACHE_COOKIE_LOOKING_UP and
      __fscache_read_or_alloc_page() can start waiting, before
      ->backing_objects is cleared
      
      There is some uncertainty in this analysis, but it seems to be fit the
      observations.  Adding the wake in this patch will be handled correctly
      by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
      is empty again, after waiting.
      
      Customer which reported the hang, also report that the hang cannot be
      reproduced with this fix.
      
      The backtrace for the blocked process looked like:
      
      PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
       #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
       #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
       #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
       #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
       #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
       #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
       #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
       #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
       #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
       #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
      #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
      #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
      #12 [ffff881ff43eff18] sys_read at ffffffff811fda62
      #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c5a94f43
    • M
      fs: fix lost error code in dio_complete · 41e817bc
      Maximilian Heyne 提交于
      commit e2592217 ("fs: simplify the
      generic_write_sync prototype") reworked callers of generic_write_sync(),
      and ended up dropping the error return for the directio path. Prior to
      that commit, in dio_complete(), an error would be bubbled up the stack,
      but after that commit, errors passed on to dio_complete were eaten up.
      
      This was reported on the list earlier, and a fix was proposed in
      https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
      never followed up with.  We recently hit this bug in our testing where
      fencing io errors, which were previously erroring out with EIO, were
      being returned as success operations after this commit.
      
      The fix proposed on the list earlier was a little short -- it would have
      still called generic_write_sync() in case `ret` already contained an
      error. This fix ensures generic_write_sync() is only called when there's
      no pending error in the write. Additionally, transferred is replaced
      with ret to bring this code in line with other callers.
      
      Fixes: e2592217 ("fs: simplify the generic_write_sync prototype")
      Reported-by: NRavi Nankani <rnankani@amazon.com>
      Signed-off-by: NMaximilian Heyne <mheyne@amazon.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      CC: Torsten Mehlan <tomeh@amazon.de>
      CC: Uwe Dannowski <uwed@amazon.de>
      CC: Amit Shah <aams@amazon.de>
      CC: David Woodhouse <dwmw@amazon.co.uk>
      CC: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41e817bc
    • D
      afs: Use d_instantiate() rather than d_add() and don't d_drop() · 73116df7
      David Howells 提交于
      Use d_instantiate() rather than d_add() and don't d_drop() in
      afs_vnode_new_inode().  The dentry shouldn't be removed as it's not
      changing its name.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      73116df7
    • D
      afs: Fix missing net error handling · 4584ae96
      David Howells 提交于
      kAFS can be given certain network errors (EADDRNOTAVAIL, EHOSTDOWN and
      ERFKILL) that it doesn't handle in its server/address rotation algorithms.
      They cause the probing and rotation to abort immediately rather than
      rotating.
      
      Fix this by:
      
       (1) Abstracting out the error prioritisation from the VL and FS rotation
           algorithms into a common function and expand usage into the server
           probing code.
      
           When multiple errors are available, this code selects the one we'd
           prefer to return.
      
       (2) Add handling for EADDRNOTAVAIL, EHOSTDOWN and ERFKILL.
      
      Fixes: 0fafdc9f ("afs: Fix file locking")
      Fixes: 0338747d8454 ("afs: Probe multiple fileservers simultaneously")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4584ae96
    • D
      afs: Fix validation/callback interaction · ae3b7361
      David Howells 提交于
      When afs_validate() is called to validate a vnode (inode), there are two
      unhandled cases in the fastpath at the top of the function:
      
       (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
           counters match and the data has expired, then there's an implicit case
           in which the vnode needs revalidating.
      
           This has no consequences since the default "valid = false" set at the
           top of the function happens to do the right thing.
      
       (2) If the vnode is not promised and it hasn't been deleted
           (AFS_VNODE_DELETED is not set) then there's a default case we're not
           handling in which the vnode is invalid.  If the vnode is invalid, we
           need to bring cb_s_break and cb_v_break up to date before we refetch
           the status.
      
           As a consequence, once the server loses track of the client
           (ie. sufficient time has passed since we last sent it an operation),
           it will send us a CB.InitCallBackState* operation when we next try to
           talk to it.  This calls afs_init_callback_state() which increments
           afs_server::cb_s_break, but this then doesn't propagate to the
           afs_vnode record.
      
           The result being that every afs_validate() call thereafter sends a
           status fetch operation to the server.
      
      Clarify and fix this by:
      
       (A) Setting valid in all the branches rather than initialising it at the
           top so that the compiler catches where we've missed.
      
       (B) Restructuring the logic in the 'promised' branch so that we set valid
           to false if the callback is due to expire (or has expired) and so that
           the final case is that the vnode is still valid.
      
       (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
           promised and deleted cases don't match.
      
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ae3b7361
    • K
      pstore/ram: Correctly calculate usable PRZ bytes · 89d328f6
      Kees Cook 提交于
      The actual number of bytes stored in a PRZ is smaller than the
      bytes requested by platform data, since there is a header on each
      PRZ. Additionally, if ECC is enabled, there are trailing bytes used
      as well. Normally this mismatch doesn't matter since PRZs are circular
      buffers and the leading "overflow" bytes are just thrown away. However, in
      the case of a compressed record, this rather badly corrupts the results.
      
      This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
      Any stored crashes would not be uncompressable (producing a pstorefs
      "dmesg-*.enc.z" file), and triggering errors at boot:
      
        [    2.790759] pstore: crypto_comp_decompress failed, ret = -22!
      
      Backporting this depends on commit 70ad35db ("pstore: Convert console
      write to use ->write_buf")
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Fixes: b0aad7a9 ("pstore: Add compression support to pstore")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      89d328f6
  9. 29 11月, 2018 2 次提交
  10. 28 11月, 2018 2 次提交
    • K
      cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active · 9a24ce5b
      Kiran Kumar Modukuri 提交于
      [Description]
      
      In a heavily loaded system where the system pagecache is nearing memory
      limits and fscache is enabled, pages can be leaked by fscache while trying
      read pages from cachefiles backend.  This can happen because two
      applications can be reading same page from a single mount, two threads can
      be trying to read the backing page at same time.  This results in one of
      the threads finding that a page for the backing file or netfs file is
      already in the radix tree.  During the error handling cachefiles does not
      clean up the reference on backing page, leading to page leak.
      
      [Fix]
      The fix is straightforward, to decrement the reference when error is
      encountered.
      
        [dhowells: Note that I've removed the clearance and put of newpage as
         they aren't attested in the commit message and don't appear to actually
         achieve anything since a new page is only allocated is newpage!=NULL and
         any residual new page is cleared before returning.]
      
      [Testing]
      I have tested the fix using following method for 12+ hrs.
      
      1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
      2) create 10000 files of 2.8MB in a NFS mount.
      3) start a thread to simulate heavy VM presssure
         (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
      4) start multiple parallel reader for data set at same time
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         ..
         ..
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
      5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
         free -h , cat /proc/meminfo and page-types -r -b lru
         to ensure all pages are freed.
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
      [dja: forward ported to current upstream]
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9a24ce5b
    • D
      cachefiles: Fix an assertion failure when trying to update a failed object · e6bc06fa
      David Howells 提交于
      If cachefiles gets an error other then ENOENT when trying to look up an
      object in the cache (in this case, EACCES), the object state machine will
      eventually transition to the DROP_OBJECT state.
      
      This state invokes fscache_drop_object() which tries to sync the auxiliary
      data with the cache (this is done lazily since commit 402cb8dd) on an
      incomplete cache object struct.
      
      The problem comes when cachefiles_update_object_xattr() is called to
      rewrite the xattr holding the data.  There's an assertion there that the
      cache object points to a dentry as we're going to update its xattr.  The
      assertion trips, however, as dentry didn't get set.
      
      Fix the problem by skipping the update in cachefiles if the object doesn't
      refer to a dentry.  A better way to do it could be to skip the update from
      the DROP_OBJECT state handler in fscache, but that might deny the cache the
      opportunity to update intermediate state.
      
      If this error occurs, the kernel log includes lines that look like the
      following:
      
       CacheFiles: Lookup failed error -13
       CacheFiles:
       CacheFiles: Assertion failed
       ------------[ cut here ]------------
       kernel BUG at fs/cachefiles/xattr.c:138!
       ...
       Workqueue: fscache_object fscache_object_work_func [fscache]
       RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
       ...
       Call Trace:
        cachefiles_update_object+0xdd/0x1c0 [cachefiles]
        fscache_update_aux_data+0x23/0x30 [fscache]
        fscache_drop_object+0x18e/0x1c0 [fscache]
        fscache_object_work_func+0x74/0x2b0 [fscache]
        process_one_work+0x18d/0x340
        worker_thread+0x2e/0x390
        ? pwq_unbound_release_workfn+0xd0/0xd0
        kthread+0x112/0x130
        ? kthread_bind+0x30/0x30
        ret_from_fork+0x35/0x40
      
      Note that there are actually two issues here: (1) EACCES happened on a
      cache object and (2) an oops occurred.  I think that the second is a
      consequence of the first (it certainly looks like it ought to be).  This
      patch only deals with the second.
      
      Fixes: 402cb8dd ("fscache: Attach the index key and aux data to the cookie")
      Reported-by: NZhibin Li <zhibli@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e6bc06fa
  11. 27 11月, 2018 2 次提交