1. 15 10月, 2018 5 次提交
  2. 13 10月, 2018 3 次提交
  3. 12 10月, 2018 3 次提交
    • D
      afs: Fix afs_server struct leak · f014ffb0
      David Howells 提交于
      Fix a leak of afs_server structs.  The routine that installs them in the
      various lookup lists and trees gets a ref on leaving the function, whether
      it added the server or a server already exists.  It shouldn't increment
      the refcount if it added the server.
      
      The effect of this that "rmmod kafs" will hang waiting for the leaked
      server to become unused.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f014ffb0
    • A
      gfs2: Fix iomap buffered write support for journaled files (2) · fee5150c
      Andreas Gruenbacher 提交于
      It turns out that the fix in commit 6636c3cc56 is bad; the assertion
      that the iomap code no longer creates buffer heads is incorrect for
      filesystems that set the IOMAP_F_BUFFER_HEAD flag.
      
      Instead, what's happening is that gfs2_iomap_begin_write treats all
      files that have the jdata flag set as journaled files, which is
      incorrect as long as those files are inline ("stuffed").  We're handling
      stuffed files directly via the page cache, which is why we ended up with
      pages without buffer heads in gfs2_page_add_databufs.
      
      Fix this by handling stuffed journaled files correctly in
      gfs2_iomap_begin_write.
      
      This reverts commit 6636c3cc5690c11631e6366cf9a28fb99c8b25bb.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      fee5150c
    • D
      afs: Fix cell proc list · 6b3944e4
      David Howells 提交于
      Access to the list of cells by /proc/net/afs/cells has a couple of
      problems:
      
       (1) It should be checking against SEQ_START_TOKEN for the keying the
           header line.
      
       (2) It's only holding the RCU read lock, so it can't just walk over the
           list without following the proper RCU methods.
      
      Fix these by using an hlist instead of an ordinary list and using the
      appropriate accessor functions to follow it with RCU.
      
      Since the code that adds a cell to the list must also necessarily change,
      sort the list on insertion whilst we're at it.
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b3944e4
  4. 10 10月, 2018 1 次提交
  5. 09 10月, 2018 1 次提交
    • D
      filesystem-dax: Fix dax_layout_busy_page() livelock · d7782145
      Dan Williams 提交于
      In the presence of multi-order entries the typical
      pagevec_lookup_entries() pattern may loop forever:
      
      	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
      				min(end - index, (pgoff_t)PAGEVEC_SIZE),
      				indices)) {
      		...
      		for (i = 0; i < pagevec_count(&pvec); i++) {
      			index = indices[i];
      			...
      		}
      		index++; /* BUG */
      	}
      
      The loop updates 'index' for each index found and then increments to the
      next possible page to continue the lookup. However, if the last entry in
      the pagevec is multi-order then the next possible page index is more
      than 1 page away. Fix this locally for the filesystem-dax case by
      checking for dax-multi-order entries. Going forward new users of
      multi-order entries need to be similarly careful, or we need a generic
      way to report the page increment in the radix iterator.
      
      Fixes: 5fac7408 ("mm, fs, dax: handle layout changes to pinned dax...")
      Cc: <stable@vger.kernel.org>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d7782145
  6. 06 10月, 2018 5 次提交
    • D
      xfs: fix data corruption w/ unaligned reflink ranges · b3998900
      Dave Chinner 提交于
      When reflinking sub-file ranges, a data corruption can occur when
      the source file range includes a partial EOF block. This shares the
      unknown data beyond EOF into the second file at a position inside
      EOF, exposing stale data in the second file.
      
      XFS only supports whole block sharing, but we still need to
      support whole file reflink correctly.  Hence if the reflink
      request includes the last block of the souce file, only proceed with
      the reflink operation if it lands at or past the destination file's
      current EOF. If it lands within the destination file EOF, reject the
      entire request with -EINVAL and make the caller go the hard way.
      
      This avoids the data corruption vector, but also avoids disruption
      of returning EINVAL to userspace for the common case of whole file
      cloning.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b3998900
    • D
      xfs: fix data corruption w/ unaligned dedupe ranges · dceeb47b
      Dave Chinner 提交于
      A deduplication data corruption is Exposed by fstests generic/505 on
      XFS. It is caused by extending the block match range to include the
      partial EOF block, but then allowing unknown data beyond EOF to be
      considered a "match" to data in the destination file because the
      comparison is only made to the end of the source file. This corrupts
      the destination file when the source extent is shared with it.
      
      XFS only supports whole block dedupe, but we still need to appear to
      support whole file dedupe correctly.  Hence if the dedupe request
      includes the last block of the souce file, don't include it in the
      actual XFS dedupe operation. If the rest of the range dedupes
      successfully, then report the partial last block as deduped, too, so
      that userspace sees it as a successful dedupe rather than return
      EINVAL because we can't dedupe unaligned blocks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dceeb47b
    • A
      ocfs2: fix locking for res->tracking and dlm->tracking_list · cbe355f5
      Ashish Samant 提交于
      In dlm_init_lockres() we access and modify res->tracking and
      dlm->tracking_list without holding dlm->track_lock.  This can cause list
      corruptions and can end up in kernel panic.
      
      Fix this by locking res->tracking and dlm->tracking_list with
      dlm->track_lock instead of dlm->spinlock.
      
      Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.comSigned-off-by: NAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Acked-by: NJoseph Qi <jiangqi903@gmail.com>
      Acked-by: NJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbe355f5
    • J
      proc: restrict kernel stack dumps to root · f8a00cef
      Jann Horn 提交于
      Currently, you can use /proc/self/task/*/stack to cause a stack walk on
      a task you control while it is running on another CPU.  That means that
      the stack can change under the stack walker.  The stack walker does
      have guards against going completely off the rails and into random
      kernel memory, but it can interpret random data from your kernel stack
      as instruction pointers and stack pointers.  This can cause exposure of
      kernel stack contents to userspace.
      
      Restrict the ability to inspect kernel stacks of arbitrary tasks to root
      in order to prevent a local attacker from exploiting racy stack unwinding
      to leak kernel task stack contents.  See the added comment for a longer
      rationale.
      
      There don't seem to be any users of this userspace API that can't
      gracefully bail out if reading from the file fails.  Therefore, I believe
      that this change is unlikely to break things.  In the case that this patch
      does end up needing a revert, the next-best solution might be to fake a
      single-entry stack based on wchan.
      
      Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
      Fixes: 2ec220e2 ("proc: add /proc/*/stack")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8a00cef
    • L
      ocfs2: fix crash in ocfs2_duplicate_clusters_by_page() · 69eb7765
      Larry Chen 提交于
      ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
      is dirty.  When a page has not been written back, it is still in dirty
      state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
      page, the crash happens.
      
      To fix this bug, we can just unlock the page and wait until the page until
      its not dirty.
      
      The following is the backtrace:
      
      kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
      [exception RIP: ocfs2_duplicate_clusters_by_page+822]
      __ocfs2_move_extent+0x80/0x450 [ocfs2]
      ? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
      ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
      __ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
      ocfs2_move_extents+0x180/0x3b0 [ocfs2]
      ? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
      ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
      ocfs2_ioctl+0x253/0x640 [ocfs2]
      do_vfs_ioctl+0x90/0x5f0
      SyS_ioctl+0x74/0x80
      do_syscall_64+0x74/0x140
      entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Once we find the page is dirty, we do not wait until it's clean, rather we
      use write_one_page() to write it back
      
      Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
      [lchen@suse.com: update comments]
        Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NLarry Chen <lchen@suse.com>
      Acked-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69eb7765
  7. 05 10月, 2018 3 次提交
  8. 04 10月, 2018 2 次提交
    • M
      ovl: fix format of setxattr debug · 1a8f8d2a
      Miklos Szeredi 提交于
      Format has a typo: it was meant to be "%.*s", not "%*s".  But at some point
      callers grew nonprintable values as well, so use "%*pE" instead with a
      maximized length.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 3a1e819b ("ovl: store file handle of lower inode on copy up")
      Cc: <stable@vger.kernel.org> # v4.12
      1a8f8d2a
    • A
      ovl: fix access beyond unterminated strings · 601350ff
      Amir Goldstein 提交于
      KASAN detected slab-out-of-bounds access in printk from overlayfs,
      because string format used %*s instead of %.*s.
      
      > BUG: KASAN: slab-out-of-bounds in string+0x298/0x2d0 lib/vsprintf.c:604
      > Read of size 1 at addr ffff8801c36c66ba by task syz-executor2/27811
      >
      > CPU: 0 PID: 27811 Comm: syz-executor2 Not tainted 4.19.0-rc5+ #36
      ...
      >  printk+0xa7/0xcf kernel/printk/printk.c:1996
      >  ovl_lookup_index.cold.15+0xe8/0x1f8 fs/overlayfs/namei.c:689
      
      Reported-by: syzbot+376cea2b0ef340db3dd4@syzkaller.appspotmail.com
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 359f392c ("ovl: lookup index entry for copy up origin")
      Cc: <stable@vger.kernel.org> # v4.13
      601350ff
  9. 03 10月, 2018 4 次提交
    • S
      smb3: fix lease break problem introduced by compounding · 7af929d6
      Steve French 提交于
      Fixes problem (discovered by Aurelien) introduced by recent commit:
      commit b24df3e3
      ("cifs: update receive_encrypted_standard to handle compounded responses")
      
      which broke the ability to respond to some lease breaks
      (lease breaks being ignored is a problem since can block
      server response for duration of the lease break timeout).
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      7af929d6
    • R
      cifs: only wake the thread for the very last PDU in a compound · 4e34feb5
      Ronnie Sahlberg 提交于
      For compounded PDUs we whould only wake the waiting thread for the
      very last PDU of the compound.
      We do this so that we are guaranteed that the demultiplex_thread will
      not process or access any of those MIDs any more once the send/recv
      thread starts processing.
      
      Else there is a race where at the end of the send/recv processing we
      will try to delete all the mids of the compound. If the multiplex
      thread still has other mids to process at this point for this compound
      this can lead to an oops.
      
      Needed to fix recent commit:
      commit 730928c8
      ("cifs: update smb2_queryfs() to use compounding")
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      4e34feb5
    • R
      cifs: add a warning if we try to to dequeue a deleted mid · ddf83afb
      Ronnie Sahlberg 提交于
      cifs_delete_mid() is called once we are finished handling a mid and we
      expect no more work done on this mid.
      
      Needed to fix recent commit:
      commit 730928c8
      ("cifs: update smb2_queryfs() to use compounding")
      
      Add a warning if someone tries to dequeue a mid that has already been
      flagged to be deleted.
      Also change list_del() to list_del_init() so that if we have similar bugs
      resurface in the future we will not oops.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      ddf83afb
    • A
      smb2: fix missing files in root share directory listing · 0595751f
      Aurelien Aptel 提交于
      When mounting a Windows share that is the root of a drive (eg. C$)
      the server does not return . and .. directory entries. This results in
      the smb2 code path erroneously skipping the 2 first entries.
      
      Pseudo-code of the readdir() code path:
      
      cifs_readdir(struct file, struct dir_context)
          initiate_cifs_search            <-- if no reponse cached yet
              server->ops->query_dir_first
      
          dir_emit_dots
              dir_emit                    <-- adds "." and ".." if we're at pos=0
      
          find_cifs_entry
              initiate_cifs_search        <-- if pos < start of current response
                                               (restart search)
              server->ops->query_dir_next <-- if pos > end of current response
                                               (fetch next search res)
      
          for(...)                        <-- loops over cur response entries
                                                starting at pos
              cifs_filldir                <-- skip . and .., emit entry
                  cifs_fill_dirent
                  dir_emit
      	pos++
      
      A) dir_emit_dots() always adds . & ..
         and sets the current dir pos to 2 (0 and 1 are done).
      
      Therefore we always want the index_to_find to be 2 regardless of if
      the response has . and ..
      
      B) smb1 code initializes index_of_last_entry with a +2 offset
      
        in cifssmb.c CIFSFindFirst():
      		psrch_inf->index_of_last_entry = 2 /* skip . and .. */ +
      			psrch_inf->entries_in_buffer;
      
      Later in find_cifs_entry() we want to find the next dir entry at pos=2
      as a result of (A)
      
      	first_entry_in_buffer = cfile->srch_inf.index_of_last_entry -
      					cfile->srch_inf.entries_in_buffer;
      
      This var is the dir pos that the first entry in the buffer will
      have therefore it must be 2 in the first call.
      
      If we don't offset index_of_last_entry by 2 (like in (B)),
      first_entry_in_buffer=0 but we were instructed to get pos=2 so this
      code in find_cifs_entry() skips the 2 first which is ok for non-root
      shares, as it skips . and .. from the response but is not ok for root
      shares where the 2 first are actual files
      
      		pos_in_buf = index_to_find - first_entry_in_buffer;
                      // pos_in_buf=2
      		// we skip 2 first response entries :(
      		for (i = 0; (i < (pos_in_buf)) && (cur_ent != NULL); i++) {
      			/* go entry by entry figuring out which is first */
      			cur_ent = nxt_dir_entry(cur_ent, end_of_smb,
      						cfile->srch_inf.info_level);
      		}
      
      C) cifs_filldir() skips . and .. so we can safely ignore them for now.
      
      Sample program:
      
      int main(int argc, char **argv)
      {
      	const char *path = argc >= 2 ? argv[1] : ".";
      	DIR *dh;
      	struct dirent *de;
      
      	printf("listing path <%s>\n", path);
      	dh = opendir(path);
      	if (!dh) {
      		printf("opendir error %d\n", errno);
      		return 1;
      	}
      
      	while (1) {
      		de = readdir(dh);
      		if (!de) {
      			if (errno) {
      				printf("readdir error %d\n", errno);
      				return 1;
      			}
      			printf("end of listing\n");
      			break;
      		}
      		printf("off=%lu <%s>\n", de->d_off, de->d_name);
      	}
      
      	return 0;
      }
      
      Before the fix with SMB1 on root shares:
      
      <.>            off=1
      <..>           off=2
      <$Recycle.Bin> off=3
      <bootmgr>      off=4
      
      and on non-root shares:
      
      <.>    off=1
      <..>   off=4  <-- after adding .., the offsets jumps to +2 because
      <2536> off=5       we skipped . and .. from response buffer (C)
      <411>  off=6       but still incremented pos
      <file> off=7
      <fsx>  off=8
      
      Therefore the fix for smb2 is to mimic smb1 behaviour and offset the
      index_of_last_entry by 2.
      
      Test results comparing smb1 and smb2 before/after the fix on root
      share, non-root shares and on large directories (ie. multi-response
      dir listing):
      
      PRE FIX
      =======
      pre-1-root VS pre-2-root:
              ERR pre-2-root is missing [bootmgr, $Recycle.Bin]
      pre-1-nonroot VS pre-2-nonroot:
              OK~ same files, same order, different offsets
      pre-1-nonroot-large VS pre-2-nonroot-large:
              OK~ same files, same order, different offsets
      
      POST FIX
      ========
      post-1-root VS post-2-root:
              OK same files, same order, same offsets
      post-1-nonroot VS post-2-nonroot:
              OK same files, same order, same offsets
      post-1-nonroot-large VS post-2-nonroot-large:
              OK same files, same order, same offsets
      
      REGRESSION?
      ===========
      pre-1-root VS post-1-root:
              OK same files, same order, same offsets
      pre-1-nonroot VS post-1-nonroot:
              OK same files, same order, same offsets
      
      BugLink: https://bugzilla.samba.org/show_bug.cgi?id=13107Signed-off-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NPaulo Alcantara <palcantara@suse.deR>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      0595751f
  10. 01 10月, 2018 2 次提交
    • D
      xfs: fix error handling in xfs_bmap_extents_to_btree · e55ec4dd
      Dave Chinner 提交于
      Commit 01239d77 ("xfs: fix a null pointer dereference in
      xfs_bmap_extents_to_btree") attempted to fix a null pointer
      dreference when a fuzzing corruption of some kind was found.
      This fix was flawed, resulting in assert failures like:
      
      XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
      .....
      Call Trace:
        xfs_bmap_extents_to_btree+0x6b9/0x7b0
        __xfs_bunmapi+0xae7/0xf00
        ? xfs_log_reserve+0x1c8/0x290
        xfs_reflink_remap_extent+0x20b/0x620
        xfs_reflink_remap_blocks+0x7e/0x290
        xfs_reflink_remap_range+0x311/0x530
        vfs_dedupe_file_range_one+0xd7/0xe0
        vfs_dedupe_file_range+0x15b/0x1a0
        do_vfs_ioctl+0x267/0x6c0
      
      The problem is that the error handling code now asserts that the
      inode fork is not in btree format before the error handling code
      undoes the modifications that put the fork back in extent format.
      Fix this by moving the assert back to after the xfs_iroot_realloc()
      call that returns the fork to extent format, and clean up the jump
      labels to be meaningful.
      
      Also, returning ENOSPC when xfs_btree_get_bufl() fails to
      instantiate the buffer that was allocated (the actual fix in the
      commit mentioned above) is incorrect. This is a fatal error - only
      an invalid block address or a filesystem shutdown can result in
      failing to get a buffer here.
      
      Hence change this to EFSCORRUPTED so that the higher layer knows
      this was a corruption related failure and should not treat it as an
      ENOSPC error.  This should result in a shutdown (via cancelling a
      dirty transaction) which is necessary as we do not attempt to clean
      up the (invalid) block that we have already allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e55ec4dd
    • K
      pstore/ram: Fix failure-path memory leak in ramoops_init · bac6f6cd
      Kees Cook 提交于
      As reported by nixiaoming, with some minor clarifications:
      
      1) memory leak in ramoops_register_dummy():
         dummy_data = kzalloc(sizeof(*dummy_data), GFP_KERNEL);
         but no kfree() if platform_device_register_data() fails.
      
      2) memory leak in ramoops_init():
         Missing platform_device_unregister(dummy) and kfree(dummy_data)
         if platform_driver_register(&ramoops_driver) fails.
      
      I've clarified the purpose of ramoops_register_dummy(), and added a
      common cleanup routine for all three failure paths to call.
      Reported-by: Nnixiaoming <nixiaoming@huawei.com>
      Cc: stable@vger.kernel.org
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Geliang Tang <geliangtang@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      bac6f6cd
  11. 29 9月, 2018 11 次提交
    • B
      iomap: set page dirty after partial delalloc on mkwrite · 561295a3
      Brian Foster 提交于
      The iomap page fault mechanism currently dirties the associated page
      after the full block range of the page has been allocated. This
      leaves the page susceptible to delayed allocations without ever
      being set dirty on sub-page block sized filesystems.
      
      For example, consider a page fault on a page with one preexisting
      real (non-delalloc) block allocated in the middle of the page. The
      first iomap_apply() iteration performs delayed allocation on the
      range up to the preexisting block, the next iteration finds the
      preexisting block, and the last iteration attempts to perform
      delayed allocation on the range after the prexisting block to the
      end of the page. If the first allocation succeeds and the final
      allocation fails with -ENOSPC, iomap_apply() returns the error and
      iomap_page_mkwrite() fails to dirty the page having already
      performed partial delayed allocation. This eventually results in the
      page being invalidated without ever converting the delayed
      allocation to real blocks.
      
      This problem is reliably reproduced by generic/083 on XFS on ppc64
      systems (64k page size, 4k block size). It results in leaked
      delalloc blocks on inode reclaim, which triggers an assert failure
      in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
      
      Move the set_page_dirty() call from iomap_page_mkwrite() to the
      actor callback, similar to how the buffer head implementation works.
      The actor callback is called iff ->iomap_begin() returns success, so
      ensures the page is dirtied as soon as possible after an allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      561295a3
    • B
      xfs: remove invalid log recovery first/last cycle check · ec2ed0b5
      Brian Foster 提交于
      One of the first steps of log recovery is to check for the special
      case of a zeroed log. If the first cycle in the log is zero or the
      tail portion of the log is zeroed, the head is set to the first
      instance of cycle 0. xlog_find_zeroed() includes a sanity check that
      enforces that the first cycle in the log must be 1 if the last cycle
      is 0. While this is true in most cases, the check is not totally
      valid because it doesn't consider the case where the filesystem
      crashed after a partial/out of order log buffer completion that
      wraps around the end of the physical log.
      
      For example, consider a filesystem that has completed most of the
      first cycle of the log, reaches the end of the physical log and
      splits the next single log buffer write into two in order to wrap
      around the end of the log. If these I/Os are reordered, the second
      (wrapped) I/O completes and the first happens to fail, the log is
      left in a state where the last cycle of the log is 0 and the first
      cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
      and prevents the filesystem from mounting. This situation has been
      reproduced on particular systems via repeated runs of generic/475.
      
      This is an expected state that log recovery already knows how to
      deal with, however. Since the log is still partially zeroed, the
      head is detected correctly and points to a valid tail. The
      subsequent stale block detection clears blocks beyond the head up to
      the tail (within a maximum range), with the express purpose of
      clearing such out of order writes. As expected, this removes the out
      of order cycle 2 blocks at the physical start of the log.
      
      In other words, the only thing that prevents a clean mount and
      recovery of the filesystem in this scenario is the specific (last ==
      0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
      head/tail are now independently validated via cycle, log record and
      CRC checks, this highly specific first cycle check is of dubious
      value. Remove it and rely on the higher level validation to
      determine whether log content is sane and recoverable.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ec2ed0b5
    • E
      xfs: validate inode di_forkoff · 339e1a3f
      Eric Sandeen 提交于
      Verify the inode di_forkoff, lifted from xfs_repair's
      process_check_inode_forkoff().
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      339e1a3f
    • C
      xfs: skip delalloc COW blocks in xfs_reflink_end_cow · f5f3f959
      Christoph Hellwig 提交于
      The iomap direct I/O code issues a single ->end_io call for the whole
      I/O request, and if some of the extents cowered needed a COW operation
      it will call xfs_reflink_end_cow over the whole range.
      
      When we do AIO writes we drop the iolock after doing the initial setup,
      but before the I/O completion.  Between dropping the lock and completing
      the I/O we can have a racing buffered write create new delalloc COW fork
      extents in the region covered by the outstanding direct I/O write, and
      thus see delalloc COW fork extents in xfs_reflink_end_cow.  As
      concurrent writes are fundamentally racy and no guarantees are given we
      can simply skip those.
      
      This can be easily reproduced with xfstests generic/208 in always_cow
      mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      f5f3f959
    • E
      xfs: don't treat unknown di_flags2 as corruption in scrub · f369a13c
      Eric Sandeen 提交于
      xchk_inode_flags2() currently treats any di_flags2 values that the
      running kernel doesn't recognize as corruption, and calls
      xchk_ino_set_corrupt() if they are set.  However, it's entirely possible
      that these flags were set in some newer kernel and are quite valid,
      but ignored in this kernel.
      
      (Validators don't care one bit about unknown di_flags2.)
      
      Call xchk_ino_set_warning instead, because this may or may not actually
      indicate a problem.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f369a13c
    • Y
      xfs: remove duplicated include from alloc.c · 2863c2eb
      YueHaibing 提交于
      Remove duplicated include xfs_alloc.h
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      2863c2eb
    • C
      xfs: don't bring in extents in xfs_bmap_punch_delalloc_range · 0065b541
      Christoph Hellwig 提交于
      This function is only used to punch out delayed allocations on I/O
      failure, which means we need to have read the extents earlier.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0065b541
    • D
      xfs: fix transaction leak in xfs_reflink_allocate_cow() · df307077
      Dave Chinner 提交于
      When xfs_reflink_allocate_cow() allocates a transaction, it drops
      the ILOCK to perform the operation. This Introduces a race condition
      where another thread modifying the file can perform the COW
      allocation operation underneath us. This result in the retry loop
      finding an allocated block and jumping straight to the conversion
      code. It does not, however, cancel the transaction it holds and so
      this gets leaked. This results in a lockdep warning:
      
      ================================================
      WARNING: lock held when returning to user space!
      4.18.5 #1 Not tainted
      ------------------------------------------------
      worker/6123 is leaving the kernel with locks still held!
      1 lock held by worker/6123:
       #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220
      
      And eventually the filesystem deadlocks because it runs out of log
      space that is reserved by the leaked transaction and never gets
      released.
      
      The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
      gotos - it's no surprise that it has bug where the flow through
      several goto jumps then fails to clean up context from a non-obvious
      logic path. CLean up the logic flow and make sure every path does
      the right thing.
      Reported-by: NAlexander Y. Fomichev <git.user@gmail.com>
      Tested-by: NAlexander Y. Fomichev <git.user@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: slight refactor]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      df307077
    • D
      xfs: avoid lockdep false positives in xfs_trans_alloc · 8683edb7
      Dave Chinner 提交于
      We've had a few reports of lockdep tripping over memory reclaim
      context vs filesystem freeze "deadlocks". They all have looked
      to be false positives on analysis, but it seems that they are
      being tripped because we take freeze references before we run
      a GFP_KERNEL allocation for the struct xfs_trans.
      
      We can avoid this false positive vector just by re-ordering the
      operations in xfs_trans_alloc(). That is. we need allocate the
      structure before we take the freeze reference and enter the GFP_NOFS
      allocation context that follows the xfs_trans around. This prevents
      lockdep from seeing the GFP_KERNEL allocation inside the transaction
      context, and that prevents it from triggering the freeze level vs
      alloc context vs reclaim warnings.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      8683edb7
    • B
      xfs: refactor xfs_buf_log_item reference count handling · 95808459
      Brian Foster 提交于
      The xfs_buf_log_item structure has a reference counter with slightly
      tricky semantics. In the common case, a buffer is logged and
      committed in a transaction, committed to the on-disk log (added to
      the AIL) and then finally written back and removed from the AIL. The
      bli refcount covers two potentially overlapping timeframes:
      
       1. the bli is held in an active transaction
       2. the bli is pinned by the log
      
      The caveat to this approach is that the reference counter does not
      purely dictate the lifetime of the bli. IOW, when a dirty buffer is
      physically logged and unpinned, the bli refcount may go to zero as
      the log item is inserted into the AIL. Only once the buffer is
      written back can the bli finally be freed.
      
      The above semantics means that it is not enough for the various
      refcount decrementing contexts to release the bli on decrement to
      zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
      unpin (->iop_unpin()) must all drop the associated reference and
      make additional checks to determine if the current context is
      responsible for freeing the item.
      
      For example, if a transaction holds but does not dirty a particular
      bli, the commit may drop the refcount to zero. If the bli itself is
      clean, it is also not AIL resident and must be freed at this time.
      The same is true for xfs_trans_brelse(). If the transaction dirties
      a bli and then aborts or an unpin results in an abort due to a log
      I/O error, the last reference count holder is expected to explicitly
      remove the item from the AIL and release it (since an abort means
      filesystem shutdown and metadata writeback will never occur).
      
      This leads to fairly complex checks being replicated in a few
      different places. Since ->iop_unlock() and xfs_trans_brelse() are
      nearly identical, refactor the logic into a common helper that
      implements and documents the semantics in one place. This patch does
      not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      95808459
    • B
      xfs: clean up xfs_trans_brelse() · 23420d05
      Brian Foster 提交于
      xfs_trans_brelse() is a bit of a historical mess, similar to
      xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
      commented out code, inconsistency with regard to stale items, etc.
      
      Clean up xfs_trans_brelse() to use similar logic and flow as
      xfs_buf_item_unlock() with regard to bli reference count handling.
      This patch makes no functional changes, but facilitates further
      refactoring of the common bli reference count handling code.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      23420d05