1. 24 5月, 2017 7 次提交
  2. 17 5月, 2017 1 次提交
  3. 06 5月, 2017 2 次提交
    • B
      GFS2: Allow glocks to be unlocked after withdraw · ed17545d
      Bob Peterson 提交于
      This bug fixes a regression introduced by patch 0d1c7ae9.
      
      The intent of the patch was to stop promoting glocks after a
      file system is withdrawn due to a variety of errors, because doing
      so results in a BUG(). (You should be able to unmount after a
      withdraw rather than having the kernel panic.)
      
      Unfortunately, it also stopped demotions, so glocks could not be
      unlocked after withdraw, which means the unmount would hang.
      
      This patch allows function do_xmote to demote locks to an
      unlocked state after a withdraw, but not promote them.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      ed17545d
    • E
      xfs: fix use-after-free in xfs_finish_page_writeback · 161f55ef
      Eryu Guan 提交于
      Commit 28b783e4 ("xfs: bufferhead chains are invalid after
      end_page_writeback") fixed one use-after-free issue by
      pre-calculating the loop conditionals before calling bh->b_end_io()
      in the end_io processing loop, but it assigned 'next' pointer before
      checking end offset boundary & breaking the loop, at which point the
      bh might be freed already, and caused use-after-free.
      
      This is caught by KASAN when running fstests generic/127 on sub-page
      block size XFS.
      
      [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
      [ 2747.868840] ==================================================================
      [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
      ...
      [ 2747.918245] Call Trace:
      [ 2747.920975]  dump_stack+0x63/0x84
      [ 2747.924673]  kasan_object_err+0x21/0x70
      [ 2747.928950]  kasan_report+0x271/0x530
      [ 2747.933064]  ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
      [ 2747.938409]  ? end_page_writeback+0xce/0x110
      [ 2747.943171]  __asan_report_load8_noabort+0x19/0x20
      [ 2747.948545]  xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
      [ 2747.953724]  xfs_end_io+0x1af/0x2b0 [xfs]
      [ 2747.958197]  process_one_work+0x5ff/0x1000
      [ 2747.962766]  worker_thread+0xe4/0x10e0
      [ 2747.966946]  kthread+0x2d3/0x3d0
      [ 2747.970546]  ? process_one_work+0x1000/0x1000
      [ 2747.975405]  ? kthread_create_on_node+0xc0/0xc0
      [ 2747.980457]  ? syscall_return_slowpath+0xe6/0x140
      [ 2747.985706]  ? do_page_fault+0x30/0x80
      [ 2747.989887]  ret_from_fork+0x2c/0x40
      [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
      [ 2748.001155] Allocated:
      [ 2748.003782] PID = 8327
      [ 2748.006411]  save_stack_trace+0x1b/0x20
      [ 2748.010688]  save_stack+0x46/0xd0
      [ 2748.014383]  kasan_kmalloc+0xad/0xe0
      [ 2748.018370]  kasan_slab_alloc+0x12/0x20
      [ 2748.022648]  kmem_cache_alloc+0xb8/0x1b0
      [ 2748.027024]  alloc_buffer_head+0x22/0xc0
      [ 2748.031399]  alloc_page_buffers+0xd1/0x250
      [ 2748.035968]  create_empty_buffers+0x30/0x410
      [ 2748.040730]  create_page_buffers+0x120/0x1b0
      [ 2748.045493]  __block_write_begin_int+0x17a/0x1800
      [ 2748.050740]  iomap_write_begin+0x100/0x2f0
      [ 2748.055308]  iomap_zero_range_actor+0x253/0x5c0
      [ 2748.060362]  iomap_apply+0x157/0x270
      [ 2748.064347]  iomap_zero_range+0x5a/0x80
      [ 2748.068624]  iomap_truncate_page+0x6b/0xa0
      [ 2748.073227]  xfs_setattr_size+0x1f7/0xa10 [xfs]
      [ 2748.078312]  xfs_vn_setattr_size+0x68/0x140 [xfs]
      [ 2748.083589]  xfs_file_fallocate+0x4ac/0x820 [xfs]
      [ 2748.088838]  vfs_fallocate+0x2cf/0x780
      [ 2748.093021]  SyS_fallocate+0x48/0x80
      [ 2748.097006]  do_syscall_64+0x18a/0x430
      [ 2748.101186]  return_from_SYSCALL_64+0x0/0x6a
      [ 2748.105948] Freed:
      [ 2748.108189] PID = 8327
      [ 2748.110816]  save_stack_trace+0x1b/0x20
      [ 2748.115093]  save_stack+0x46/0xd0
      [ 2748.118788]  kasan_slab_free+0x73/0xc0
      [ 2748.122969]  kmem_cache_free+0x7a/0x200
      [ 2748.127247]  free_buffer_head+0x41/0x80
      [ 2748.131524]  try_to_free_buffers+0x178/0x250
      [ 2748.136316]  xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
      [ 2748.141563]  try_to_release_page+0x100/0x180
      [ 2748.146325]  invalidate_inode_pages2_range+0x7da/0xcf0
      [ 2748.152087]  xfs_shift_file_space+0x37d/0x6e0 [xfs]
      [ 2748.157557]  xfs_collapse_file_space+0x49/0x120 [xfs]
      [ 2748.163223]  xfs_file_fallocate+0x2a7/0x820 [xfs]
      [ 2748.168462]  vfs_fallocate+0x2cf/0x780
      [ 2748.172642]  SyS_fallocate+0x48/0x80
      [ 2748.176629]  do_syscall_64+0x18a/0x430
      [ 2748.180810]  return_from_SYSCALL_64+0x0/0x6a
      
      Fixed it by checking on offset against end & breaking out first,
      dereference bh only if there're still bufferheads to process.
      Signed-off-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      161f55ef
  4. 05 5月, 2017 5 次提交
  5. 04 5月, 2017 25 次提交
    • E
      ext4: clean up ext4_match() and callers · d9b9f8d5
      Eric Biggers 提交于
      When ext4 encryption was originally merged, we were encrypting the
      user-specified filename in ext4_match(), introducing a lot of additional
      complexity into ext4_match() and its callers.  This has since been
      changed to encrypt the filename earlier, so we can remove the gunk
      that's no longer needed.  This more or less reverts ext4_search_dir()
      and ext4_find_dest_de() to the way they were in the v4.0 kernel.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d9b9f8d5
    • E
      f2fs: switch to using fscrypt_match_name() · 1f73d491
      Eric Biggers 提交于
      Switch f2fs directory searches to use the fscrypt_match_name() helper
      function.  There should be no functional change.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1f73d491
    • E
      ext4: switch to using fscrypt_match_name() · 067d1023
      Eric Biggers 提交于
      Switch ext4 directory searches to use the fscrypt_match_name() helper
      function.  There should be no functional change.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      067d1023
    • E
      fscrypt: introduce helper function for filename matching · 17159420
      Eric Biggers 提交于
      Introduce a helper function fscrypt_match_name() which tests whether a
      fscrypt_name matches a directory entry.  Also clean up the magic numbers
      and document things properly.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      17159420
    • E
      fscrypt: avoid collisions when presenting long encrypted filenames · 6b06cdee
      Eric Biggers 提交于
      When accessing an encrypted directory without the key, userspace must
      operate on filenames derived from the ciphertext names, which contain
      arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
      we can't always just base64-encode the ciphertext, since that may make
      it too long.  Currently, this is solved by presenting long names in an
      abbreviated form containing any needed filesystem-specific hashes (e.g.
      to identify a directory block), then the last 16 bytes of ciphertext.
      This needs to be sufficient to identify the actual name on lookup.
      
      However, there is a bug.  It seems to have been assumed that due to the
      use of a CBC (ciphertext block chaining)-based encryption mode, the last
      16 bytes (i.e. the AES block size) of ciphertext would depend on the
      full plaintext, preventing collisions.  However, we actually use CBC
      with ciphertext stealing (CTS), which handles the last two blocks
      specially, causing them to appear "flipped".  Thus, it's actually the
      second-to-last block which depends on the full plaintext.
      
      This caused long filenames that differ only near the end of their
      plaintexts to, when observed without the key, point to the wrong inode
      and be undeletable.  For example, with ext4:
      
          # echo pass | e4crypt add_key -p 16 edir/
          # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
          # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
          100000
          # sync
          # echo 3 > /proc/sys/vm/drop_caches
          # keyctl new_session
          # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
          2004
          # rm -rf edir/
          rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
          ...
      
      To fix this, when presenting long encrypted filenames, encode the
      second-to-last block of ciphertext rather than the last 16 bytes.
      
      Although it would be nice to solve this without depending on a specific
      encryption mode, that would mean doing a cryptographic hash like SHA-256
      which would be much less efficient.  This way is sufficient for now, and
      it's still compatible with encryption modes like HEH which are strong
      pseudorandom permutations.  Also, changing the presented names is still
      allowed at any time because they are only provided to allow applications
      to do things like delete encrypted directories.  They're not designed to
      be used to persistently identify files --- which would be hard to do
      anyway, given that they're encrypted after all.
      
      For ease of backports, this patch only makes the minimal fix to both
      ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
      ciphertext block yet.  Follow-on patches will clean things up properly
      and make the filesystems use a shared helper function.
      
      Fixes: 5de0b4d0 ("ext4 crypto: simplify and speed up filename encryption")
      Reported-by: NGwendal Grignou <gwendal@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6b06cdee
    • J
      f2fs: check entire encrypted bigname when finding a dentry · 6332cd32
      Jaegeuk Kim 提交于
      If user has no key under an encrypted dir, fscrypt gives digested dentries.
      Previously, when looking up a dentry, f2fs only checks its hash value with
      first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
      This patch enhances to check entire dentry bytes likewise ext4.
      
      Eric reported how to reproduce this issue by:
      
       # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
       # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
      100000
       # sync
       # echo 3 > /proc/sys/vm/drop_caches
       # keyctl new_session
       # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
      99999
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      (fixed f2fs_dentry_hash() to work even when the hash is 0)
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6332cd32
    • E
      ubifs: check for consistent encryption contexts in ubifs_lookup() · 413d5a9e
      Eric Biggers 提交于
      As ext4 and f2fs do, ubifs should check for consistent encryption
      contexts during ->lookup() in an encrypted directory.  This protects
      certain users of filesystem encryption against certain types of offline
      attacks.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      413d5a9e
    • E
      f2fs: sync f2fs_lookup() with ext4_lookup() · faac7fd9
      Eric Biggers 提交于
      As for ext4, now that fscrypt_has_permitted_context() correctly handles
      the case where we have the key for the parent directory but not the
      child, f2fs_lookup() no longer has to work around it.  Also add the same
      warning message that ext4 uses.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      faac7fd9
    • E
      ext4: remove "nokey" check from ext4_lookup() · 8c68084b
      Eric Biggers 提交于
      Now that fscrypt_has_permitted_context() correctly handles the case
      where we have the key for the parent directory but not the child, we
      don't need to try to work around this in ext4_lookup().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8c68084b
    • E
      fscrypt: fix context consistency check when key(s) unavailable · 272f98f6
      Eric Biggers 提交于
      To mitigate some types of offline attacks, filesystem encryption is
      designed to enforce that all files in an encrypted directory tree use
      the same encryption policy (i.e. the same encryption context excluding
      the nonce).  However, the fscrypt_has_permitted_context() function which
      enforces this relies on comparing struct fscrypt_info's, which are only
      available when we have the encryption keys.  This can cause two
      incorrect behaviors:
      
      1. If we have the parent directory's key but not the child's key, or
         vice versa, then fscrypt_has_permitted_context() returned false,
         causing applications to see EPERM or ENOKEY.  This is incorrect if
         the encryption contexts are in fact consistent.  Although we'd
         normally have either both keys or neither key in that case since the
         master_key_descriptors would be the same, this is not guaranteed
         because keys can be added or removed from keyrings at any time.
      
      2. If we have neither the parent's key nor the child's key, then
         fscrypt_has_permitted_context() returned true, causing applications
         to see no error (or else an error for some other reason).  This is
         incorrect if the encryption contexts are in fact inconsistent, since
         in that case we should deny access.
      
      To fix this, retrieve and compare the fscrypt_contexts if we are unable
      to set up both fscrypt_infos.
      
      While this slightly hurts performance when accessing an encrypted
      directory tree without the key, this isn't a case we really need to be
      optimizing for; access *with* the key is much more important.
      Furthermore, the performance hit is barely noticeable given that we are
      already retrieving the fscrypt_context and doing two keyring searches in
      fscrypt_get_encryption_info().  If we ever actually wanted to optimize
      this case we might start by caching the fscrypt_contexts.
      
      Cc: stable@vger.kernel.org # 4.0+
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      272f98f6
    • J
      jbd2: cleanup write flags handling from jbd2_write_superblock() · 17f423b5
      Jan Kara 提交于
      Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
      which journal superblock is written. Make this explicit by making flags
      passed down to jbd2_write_superblock() contain REQ_SYNC.
      
      CC: linux-ext4@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      17f423b5
    • J
      ext4: mark superblock writes synchronous for nobarrier mounts · 00473374
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
      generic_make_request_checks() however strips REQ_FUA flag from a bio
      when the storage doesn't report volatile write cache and thus write
      effectively becomes asynchronous which can lead to performance
      regressions. This affects superblock writes for ext4. Fix the problem
      by marking superblock writes always as synchronous.
      
      Fixes: b685d3d6
      CC: linux-ext4@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      00473374
    • J
      nfs: Fix bdi handling for cloned superblocks · 9052c7cf
      Jan Kara 提交于
      In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
      wrongly cloned bdi reference in nfs_clone_super(). Further inspection
      has shown that originally the code was actually allocating a new bdi (in
      ->clone_server callback) which was later registered in
      nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
      This could later result in bdi for the original superblock not getting
      unregistered when that superblock got shutdown (as the cloned sb still
      held bdi reference) and later when a new superblock was created under
      the same anonymous device number, a clash in sysfs has happened on bdi
      registration:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74
      sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
      Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma
      CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
      Hardware name: Allwinner sun7i (A20) Family
      [<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14)
      [<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c)
      [<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100)
      [<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48)
      [<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74)
      [<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94)
      [<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec)
      [<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98)
      [<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0)
      [<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc)
      [<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28)
      [<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc)
      [<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4)
      [<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204)
      [<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8)
      [<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58)
      [<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
      [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
      [<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0)
      [<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c)
      [<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4)
      [<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
      [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
      [<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c)
      [<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4)
      [<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c)
      
      Fix the problem by always creating new bdi for a superblock as we used
      to do.
      Reported-and-tested-by: NCorentin Labbe <clabbe.montjoie@gmail.com>
      Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9052c7cf
    • S
      SMB3: Work around mount failure when using SMB3 dialect to Macs · 7db0a6ef
      Steve French 提交于
      Macs send the maximum buffer size in response on ioctl to validate
      negotiate security information, which causes us to fail the mount
      as the response buffer is larger than the expected response.
      
      Changed ioctl response processing to allow for padding of validate
      negotiate ioctl response and limit the maximum response size to
      maximum buffer size.
      Signed-off-by: NSteve French <steve.french@primarydata.com>
      CC: Stable <stable@vger.kernel.org>
      7db0a6ef
    • Y
      f2fs: fix a mount fail for wrong next_scan_nid · e9cdd307
      Yunlei He 提交于
      -write_checkpoint
         -do_checkpoint
            -next_free_nid    <--- something wrong with next free nid
      
      -f2fs_fill_super
         -build_node_manager
            -build_free_nids
                -get_current_nat_page
                   -__get_meta_page   <--- attempt to access beyond end of device
      Signed-off-by: NYunlei He <heyunlei@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e9cdd307
    • D
      cifs: fix CIFS_IOC_GET_MNT_INFO oops · d8a6e505
      David Disseldorp 提交于
      An open directory may have a NULL private_data pointer prior to readdir.
      
      Fixes: 0de1f4c6 ("Add way to query server fs info for smb3")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      d8a6e505
    • B
      CIFS: fix mapping of SFM_SPACE and SFM_PERIOD · b704e70b
      Björn Jacke 提交于
      - trailing space maps to 0xF028
      - trailing period maps to 0xF029
      
      This fix corrects the mapping of file names which have a trailing character
      that would otherwise be illegal (period or space) but is allowed by POSIX.
      Signed-off-by: NBjoern Jacke <bjacke@samba.org>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      b704e70b
    • A
      fs/block_dev: always invalidate cleancache in invalidate_bdev() · a5f6a6a9
      Andrey Ryabinin 提交于
      invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
      which doen't make any sense.
      
      Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
      regardless of mapping->nrpages value.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5f6a6a9
    • A
      fs: fix data invalidation in the cleancache during direct IO · 55635ba7
      Andrey Ryabinin 提交于
      Patch series "Properly invalidate data in the cleancache", v2.
      
      We've noticed that after direct IO write, buffered read sometimes gets
      stale data which is coming from the cleancache.  The reason for this is
      that some direct write hooks call call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero, so we may not invalidate
      data in the cleancache.
      
      Another odd thing is that we check only for ->nrpages and don't check
      for ->nrexceptional, but invalidate_inode_pages2[_range] also
      invalidates exceptional entries as well.  So we invalidate exceptional
      entries only if ->nrpages != 0? This doesn't feel right.
      
       - Patch 1 fixes direct IO writes by removing ->nrpages check.
       - Patch 2 fixes similar case in invalidate_bdev().
           Note: I only fixed conditional cleancache_invalidate_inode() here.
             Do we also need to add ->nrexceptional check in into invalidate_bdev()?
      
       - Patches 3-4: some optimizations.
      
      This patch (of 4):
      
      Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero.  This can't be right,
      because invalidate_inode_pages2[_range]() also invalidate data in the
      cleancache via cleancache_invalidate_inode() call.  So if page cache is
      empty but there is some data in the cleancache, buffered read after
      direct IO write would get stale data from the cleancache.
      
      Also it doesn't feel right to check only for ->nrpages because
      invalidate_inode_pages2[_range] invalidates exceptional entries as well.
      
      Fix this by calling invalidate_inode_pages2[_range]() regardless of
      nrpages state.
      
      Note: nfs,cifs,9p doesn't need similar fix because the never call
      cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
      they are not affected by this bug.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55635ba7
    • M
      jbd2: make the whole kjournald2 kthread NOFS safe · eb52da3f
      Michal Hocko 提交于
      kjournald2 is central to the transaction commit processing.  As such any
      potential allocation from this kernel thread has to be GFP_NOFS.  Make
      sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb52da3f
    • M
      jbd2: mark the transaction context with the scope GFP_NOFS context · 81378da6
      Michal Hocko 提交于
      now that we have memalloc_nofs_{save,restore} api we can mark the whole
      transaction context as implicitly GFP_NOFS.  All allocations will
      automatically inherit GFP_NOFS this way.  This means that we do not have
      to mark any of those requests with GFP_NOFS and moreover all the
      ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
      GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.
      
      [akpm@linux-foundation.org: tweak comments]
      Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81378da6
    • M
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* · 9ba1fb2c
      Michal Hocko 提交于
      kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
      API to prevent from reclaim recursion into the fs because vmalloc can
      invoke unconditional GFP_KERNEL allocations and these functions might be
      called from the NOFS contexts.  The memalloc_noio_save will enforce
      GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
      unnecessary.  Let's use memalloc_nofs_{save,restore} instead as it
      should provide exactly what we need here - implicit GFP_NOFS context.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ba1fb2c
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • M
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS · 9070733b
      Michal Hocko 提交于
      xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
      some time ago.  We would like to make this concept more generic and use
      it for other filesystems as well.  Let's start by giving the flag a more
      generic name PF_MEMALLOC_NOFS which is in line with an exiting
      PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
      contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
      step before we introduce a full API for it as xfs uses the flag directly
      anyway.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9070733b
    • S
      proc: show MADV_FREE pages info in smaps · cf8496ea
      Shaohua Li 提交于
      Show MADV_FREE pages info of each vma in smaps.  The interface is for
      diganose or monitoring purpose, userspace could use it to understand
      what happens in the application.  Since userspace could dirty MADV_FREE
      pages without notice from kernel, this interface is the only place we
      can get accurate accounting info about MADV_FREE pages.
      
      [mhocko@kernel.org: update Documentation/filesystems/proc.txt]
      Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf8496ea