1. 30 4月, 2017 8 次提交
    • E
      ext4: evict inline data when writing to memory map · 7b4cc978
      Eric Biggers 提交于
      Currently the case of writing via mmap to a file with inline data is not
      handled.  This is maybe a rare case since it requires a writable memory
      map of a very small file, but it is trivial to trigger with on
      inline_data filesystem, and it causes the
      'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
      ext4_writepages() to be hit:
      
          mkfs.ext4 -O inline_data /dev/vdb
          mount /dev/vdb /mnt
          xfs_io -f /mnt/file \
      	-c 'pwrite 0 1' \
      	-c 'mmap -w 0 1m' \
      	-c 'mwrite 0 1' \
      	-c 'fsync'
      
      	kernel BUG at fs/ext4/inode.c:2723!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      	task: ffff88003d3a8040 task.stack: ffffc90000300000
      	RIP: 0010:ext4_writepages+0xc89/0xf8a
      	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
      	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
      	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
      	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
      	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
      	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
      	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
      	Call Trace:
      	 ? _raw_spin_unlock+0x27/0x2a
      	 ? kvm_clock_read+0x1e/0x20
      	 do_writepages+0x23/0x2c
      	 ? do_writepages+0x23/0x2c
      	 __filemap_fdatawrite_range+0x80/0x87
      	 filemap_write_and_wait_range+0x67/0x8c
      	 ext4_sync_file+0x20e/0x472
      	 vfs_fsync_range+0x8e/0x9f
      	 ? syscall_trace_enter+0x25b/0x2d0
      	 vfs_fsync+0x1c/0x1e
      	 do_fsync+0x31/0x4a
      	 SyS_fsync+0x10/0x14
      	 do_syscall_64+0x69/0x131
      	 entry_SYSCALL64_slow_path+0x25/0x25
      
      We could try to be smart and keep the inline data in this case, or at
      least support delayed allocation when allocating the block, but these
      solutions would be more complicated and don't seem worthwhile given how
      rare this case seems to be.  So just fix the bug by calling
      ext4_convert_inline_data() when we're asked to make a page writable, so
      that any inline data gets evicted, with the block allocated immediately.
      Reported-by: NNick Alcock <nick.alcock@oracle.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7b4cc978
    • E
      ext4: remove ext4_xattr_check_entry() · 6ba644b9
      Eric Biggers 提交于
      ext4_xattr_check_entry() was redundant with validation of the full xattr
      entries list in ext4_xattr_check_entries(), which all callers also did.
      ext4_xattr_check_entry() also didn't actually do correct validation;
      specifically, it never checked that the value doesn't overlap the xattr
      names, nor did it account for padding when checking whether the xattr
      value overflows the available space.  So remove it to eliminate any
      potential confusion.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6ba644b9
    • E
      ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries() · 2c4f9923
      Eric Biggers 提交于
      ext4_xattr_check_names() actually validates both the xattr names and
      values, not just the names.  So rename it to ext4_xattr_check_entries()
      to avoid confusion.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2c4f9923
    • E
      ext4: merge ext4_xattr_list() into ext4_listxattr() · ba7ea1d8
      Eric Biggers 提交于
      There's no difference between ext4_xattr_list() and ext4_listxattr(), so
      merge them together and just have ext4_listxattr().  Some years ago they
      took different arguments, but that's no longer the case.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ba7ea1d8
    • E
      ext4: constify static data that is never modified · d6006186
      Eric Biggers 提交于
      Constify static data in ext4 that is never (intentionally) modified so
      that it is placed in .rodata and benefits from memory protection.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d6006186
    • E
      ext4: trim return value and 'dir' argument from ext4_insert_dentry() · 1bc0af60
      Eric Biggers 提交于
      In the initial implementation of ext4 encryption, the filename was
      encrypted in ext4_insert_dentry(), which could fail and also required
      access to the 'dir' inode.  Since then ext4 filename encryption has been
      changed to encrypt the filename earlier, so we can revert the additions
      to ext4_insert_dentry().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1bc0af60
    • J
      jbd2: fix dbench4 performance regression for 'nobarrier' mounts · 5052b069
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
      JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
      filesystem is mounted with nobarrier mount option, journal superblock
      writes ended up being async writes after this patch and that caused
      heavy performance regression for dbench4 benchmark with high number of
      processes. In my test setup with HP RAID array with non-volatile write
      cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.
      
      Fix the problem by making sure journal superblock writes are always
      treated as synchronous since they generally block progress of the
      journalling machinery and thus the whole filesystem.
      
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5052b069
    • J
      jbd2: Fix lockdep splat with generic/270 test · c52c47e4
      Jan Kara 提交于
      I've hit a lockdep splat with generic/270 test complaining that:
      
      3216.fsstress.b/3533 is trying to acquire lock:
       (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150
      
      but task is already holding lock:
       (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850
      
      The underlying problem is that jbd2_journal_force_commit_nested()
      (called from ext4_should_retry_alloc()) may get called while a
      transaction handle is started. In such case it takes care to not wait
      for commit of the running transaction (which would deadlock) but only
      for a commit of a transaction that is already committing (which is safe
      as that doesn't wait for any filesystem locks).
      
      In fact there are also other callers of jbd2_log_wait_commit() that take
      care to pass tid of a transaction that is already committing and for
      those cases, the lockdep instrumentation is too restrictive and leading
      to false positive reports. Fix the problem by calling
      jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
      transaction isn't already committing.
      
      Fixes: 1eaa566dSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c52c47e4
  2. 08 4月, 2017 5 次提交
    • N
      sysfs: be careful of error returns from ops->show() · c8a139d0
      NeilBrown 提交于
      ops->show() can return a negative error code.
      Commit 65da3484 ("sysfs: correctly handle short reads on PREALLOC attrs.")
      (in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
      would look like large numbers.
      As a result, if an error is returned, sysfs_kf_read() will return the
      value of 'count', typically 4096.
      
      Commit 17d0774f ("sysfs: correctly handle read offset on PREALLOC attrs")
      (in v4.8) extended this error to use the unsigned large 'len' as a size for
      memmove().
      Consequently, if ->show returns an error, then the first read() on the
      sysfs file will return 4096 and could return uninitialized memory to
      user-space.
      If the application performs a subsequent read, this will trigger a memmove()
      with extremely large count, and is likely to crash the machine is bizarre ways.
      
      This bug can currently only be triggered by reading from an md
      sysfs attribute declared with __ATTR_PREALLOC() during the
      brief period between when mddev_put() deletes an mddev from
      the ->all_mddevs list, and when mddev_delayed_delete() - which is
      scheduled on a workqueue - completes.
      Before this, an error won't be returned by the ->show()
      After this, the ->show() won't be called.
      
      I can reproduce it reliably only by putting delay like
      	usleep_range(500000,700000);
      early in mddev_delayed_delete(). Then after creating an
      md device md0 run
        echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state
      
      The bug can be triggered without the usleep.
      
      Fixes: 65da3484 ("sysfs: correctly handle short reads on PREALLOC attrs.")
      Fixes: 17d0774f ("sysfs: correctly handle read offset on PREALLOC attrs")
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8a139d0
    • R
      dax: fix radix tree insertion race · e11f8b7b
      Ross Zwisler 提交于
      While running generic/340 in my test setup I hit the following race.  It
      can happen with kernels that support FS DAX PMDs, so v4.10 thru
      v4.11-rc5.
      
      Thread 1				Thread 2
      --------				--------
      dax_iomap_pmd_fault()
        grab_mapping_entry()
          spin_lock_irq()
          get_unlocked_mapping_entry()
          'entry' is NULL, can't call lock_slot()
          spin_unlock_irq()
          radix_tree_preload()
      					dax_iomap_pmd_fault()
      					  grab_mapping_entry()
      					    spin_lock_irq()
      					    get_unlocked_mapping_entry()
      					    ...
      					    lock_slot()
      					    spin_unlock_irq()
      					  dax_pmd_insert_mapping()
      					    <inserts a PMD mapping>
          spin_lock_irq()
          __radix_tree_insert() fails with -EEXIST
          <fall back to 4k fault, and die horribly
           when inserting a 4k entry where a PMD exists>
      
      The issue is that we have to drop mapping->tree_lock while calling
      radix_tree_preload(), but since we didn't have a radix tree entry to
      lock (unlike in the pmd_downgrade case) we have no protection against
      Thread 2 coming along and inserting a PMD at the same index.  For 4k
      entries we handled this with a special-case response to -EEXIST coming
      from the __radix_tree_insert(), but this doesn't save us for PMDs
      because the -EEXIST case can also mean that we collided with a 4k entry
      in the radix tree at a different index, but one that is covered by our
      PMD range.
      
      So, correctly handle both the 4k and 2M collision cases by explicitly
      re-checking the radix tree for an entry at our index once we reacquire
      mapping->tree_lock.
      
      This patch has made it through a clean xfstests run with the current
      v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
      loop.  It used to fail within the first 10 iterations.
      
      Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11f8b7b
    • M
      userfaultfd: report actual registered features in fdinfo · 045098e9
      Mike Rapoport 提交于
      fdinfo for userfault file descriptor reports UFFD_API_FEATURES.  Up
      until recently, the UFFD_API_FEATURES was defined as 0, therefore
      corresponding field in fdinfo always contained zero.  Now, with
      introduction of several additional features, UFFD_API_FEATURES is not
      longer 0 and it seems better to report actual features requested for the
      userfaultfd object described by the fdinfo.
      
      First, the applications that were using userfault will still see zero at
      the features field in fdinfo.  Next, reporting actual features rather
      than available features, gives clear indication of what userfault
      features are used by an application.
      
      Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045098e9
    • M
      orangefs: move features validation to fix filesystem hang · cefdc26e
      Martin Brandenburg 提交于
      Without this fix (and another to the userspace component itself
      described later), the kernel will be unable to process any OrangeFS
      requests after the userspace component is restarted (due to a crash or
      at the administrator's behest).
      
      The bug here is that inside orangefs_remount, the orangefs_request_mutex
      is locked.  When the userspace component restarts while the filesystem
      is mounted, it sends a ORANGEFS_DEV_REMOUNT_ALL ioctl to the device,
      which causes the kernel to send it a few requests aimed at synchronizing
      the state between the two.  While this is happening the
      orangefs_request_mutex is locked to prevent any other requests going
      through.
      
      This is only half of the bugfix.  The other half is in the userspace
      component which outright ignores(!) requests made before it considers
      the filesystem remounted, which is after the ioctl returns.  Of course
      the ioctl doesn't return until after the userspace component responds to
      the request it ignores.  The userspace component has been changed to
      allow ORANGEFS_VFS_OP_FEATURES regardless of the mount status.
      
      Mike Marshall says:
       "I've tested this patch against the fixed userspace part. This patch is
        real important, I hope it can make it into 4.11...
      
        Here's what happens when the userspace daemon is restarted, without
        the patch:
      
          =============================================
          [ INFO: possible recursive locking detected ]
          [   4.10.0-00007-ge98bdb30 #1 Not tainted    ]
          ---------------------------------------------
          pvfs2-client-co/29032 is trying to acquire lock:
           (orangefs_request_mutex){+.+.+.}, at: service_operation+0x3c7/0x7b0 [orangefs]
                        but task is already holding lock:
           (orangefs_request_mutex){+.+.+.}, at: dispatch_ioctl_command+0x1bf/0x330 [orangefs]
      
          CPU: 0 PID: 29032 Comm: pvfs2-client-co Not tainted 4.10.0-00007-ge98bdb30 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
          Call Trace:
           __lock_acquire+0x7eb/0x1290
           lock_acquire+0xe8/0x1d0
           mutex_lock_killable_nested+0x6f/0x6e0
           service_operation+0x3c7/0x7b0 [orangefs]
           orangefs_remount+0xea/0x150 [orangefs]
           dispatch_ioctl_command+0x227/0x330 [orangefs]
           orangefs_devreq_ioctl+0x29/0x70 [orangefs]
           do_vfs_ioctl+0xa3/0x6e0
           SyS_ioctl+0x79/0x90"
      Signed-off-by: NMartin Brandenburg <martin@omnibond.com>
      Acked-by: NMike Marshall <hubcap@omnibond.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cefdc26e
    • L
      sysctl: add sanity check for proc_douintvec · 1680a386
      Liping Zhang 提交于
      Commit e7d316a0 ("sysctl: handle error writing UINT_MAX to u32
      fields") introduced the proc_douintvec helper function, but it forgot to
      add the related sanity check when doing register_sysctl_table.  So add
      it now.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1680a386
  3. 07 4月, 2017 5 次提交
  4. 04 4月, 2017 3 次提交
    • D
      xfs: fix kernel memory exposure problems · bf9216f9
      Darrick J. Wong 提交于
      Fix a memory exposure problems in inumbers where we allocate an array of
      structures with holes, fail to zero the holes, then blindly copy the
      kernel memory contents (junk and all) into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      bf9216f9
    • C
      xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files · 3dd09d5a
      Calvin Owens 提交于
      When punching past EOF on XFS, fallocate(mode=PUNCH_HOLE|KEEP_SIZE) will
      round the file size up to the nearest multiple of PAGE_SIZE:
      
        calvinow@vm-disks/generic-xfs-1 ~$ dd if=/dev/urandom of=test bs=2048 count=1
        calvinow@vm-disks/generic-xfs-1 ~$ stat test
          Size: 2048            Blocks: 8          IO Block: 4096   regular file
        calvinow@vm-disks/generic-xfs-1 ~$ fallocate -n -l 2048 -o 2048 -p test
        calvinow@vm-disks/generic-xfs-1 ~$ stat test
          Size: 4096            Blocks: 8          IO Block: 4096   regular file
      
      Commit 3c2bdc91 ("xfs: kill xfs_zero_remaining_bytes") replaced
      xfs_zero_remaining_bytes() with calls to iomap helpers. The new helpers
      don't enforce that [pos,offset) lies strictly on [0,i_size) when being
      called from xfs_free_file_space(), so by "leaking" these ranges into
      xfs_zero_range() we get this buggy behavior.
      
      Fix this by reintroducing the checks xfs_zero_remaining_bytes() did
      against i_size at the bottom of xfs_free_file_space().
      Reported-by: NAaron Gao <gzh@fb.com>
      Fixes: 3c2bdc91 ("xfs: kill xfs_zero_remaining_bytes")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: <stable@vger.kernel.org> # 4.8+
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3dd09d5a
    • D
      xfs: rework the inline directory verifiers · 78420281
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      78420281
  5. 03 4月, 2017 7 次提交
    • D
      statx: Include a mask for stx_attributes in struct statx · 3209f68b
      David Howells 提交于
      Include a mask in struct stat to indicate which bits of stx_attributes the
      filesystem actually supports.
      
      This would also be useful if we add another system call that allows you to
      do a 'bulk attribute set' and pass in a statx struct with the masks
      appropriately set to say what you want to set.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3209f68b
    • D
      statx: Reserve the top bit of the mask for future struct expansion · 47071aee
      David Howells 提交于
      Reserve the top bit of the mask for future expansion of the statx struct
      and give an error if statx() sees it set.  All the other bits are ignored
      if we see them set but don't support the bit; we just clear the bit in the
      returned mask.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      47071aee
    • D
      xfs: report crtime and attribute flags to statx · 5f955f26
      Darrick J. Wong 提交于
      statx has the ability to report inode creation times and inode flags, so
      hook up di_crtime and di_flags to that functionality.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5f955f26
    • D
      ext4: Add statx support · 99652ea5
      David Howells 提交于
      Return enhanced file attributes from the Ext4 filesystem.  This includes
      the following:
      
       (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.
      
       (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.
      
      This requires that all ext4 inodes have a getattr call, not just some of
      them, so to this end, split the ext4_getattr() function and only call part
      of it where appropriate.
      
      Example output:
      
      	[root@andromeda ~]# touch foo
      	[root@andromeda ~]# chattr +ai foo
      	[root@andromeda ~]# /tmp/test-statx foo
      	statx(foo) = 0
      	results=fff
      	  Size: 0               Blocks: 0          IO Block: 4096    regular file
      	Device: 08:12           Inode: 2101950     Links: 1
      	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
      	Access: 2016-02-11 17:08:29.031795451+0000
      	Modify: 2016-02-11 17:08:29.031795451+0000
      	Change: 2016-02-11 17:11:11.987790114+0000
      	 Birth: 2016-02-11 17:08:29.031795451+0000
      	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99652ea5
    • E
      statx: optimize copy of struct statx to userspace · 64bd7204
      Eric Biggers 提交于
      I found that statx() was significantly slower than stat().  As a
      microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs
      file to the same with statx() passed a NULL path:
      
      	$ time ./stat_benchmark
      
      	real	0m1.464s
      	user	0m0.275s
      	sys	0m1.187s
      
      	$ time ./statx_benchmark
      
      	real	0m5.530s
      	user	0m0.281s
      	sys	0m5.247s
      
      statx is expected to be a little slower than stat because struct statx
      is larger than struct stat, but not by *that* much.  It turns out that
      most of the overhead was in copying struct statx to userspace, mostly in
      all the stac/clac instructions that got generated for each __put_user()
      call.  (This was on x86_64, but some other architectures, e.g. arm64,
      have something similar now too.)
      
      stat() instead initializes its struct on the stack and copies it to
      userspace with a single call to copy_to_user().  This turns out to be
      much faster, and changing statx to do this makes it almost as fast as
      stat:
      
      	$ time ./statx_benchmark
      
      	real	0m1.624s
      	user	0m0.270s
      	sys	0m1.354s
      
      For zeroing the reserved fields, start by zeroing the full struct with
      memset.  This makes it clear that every byte copied to userspace is
      initialized, even implicit padding bytes (though there are none
      currently).  In the scenarios I tested, it also performed the same as a
      designated initializer.  Manually initializing each field was still
      slightly faster, but would have been more error-prone and less
      verifiable.
      
      Also rename statx_set_result() to cp_statx() for consistency with
      cp_old_stat() et al., and make it noinline so that struct statx doesn't
      add to the stack usage during the main portion of the syscall execution.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64bd7204
    • E
      statx: remove incorrect part of vfs_statx() comment · b15fb70b
      Eric Biggers 提交于
      request_mask and query_flags are function arguments, not passed in
      struct kstat.  So remove the part of the comment which claims otherwise.
      This was apparently left over from an earlier version of the statx
      patch.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b15fb70b
    • E
      statx: reject unknown flags when using NULL path · 8c7493aa
      Eric Biggers 提交于
      The statx() system call currently accepts unknown flags when called with
      a NULL path to operate on a file descriptor.  Left unchanged, this could
      make it hard to introduce new query flags in the future, since
      applications may not be able to tell whether a given flag is supported.
      
      Fix this by failing the system call with EINVAL if any flags other than
      KSTAT_QUERY_FLAGS are specified in combination with a NULL path.
      
      Arguably, we could still permit known lookup-related flags such as
      AT_SYMLINK_NOFOLLOW.  However, that would be inconsistent with how
      sys_utimensat() behaves when passed a NULL path, which seems to be the
      closest precedent.  And given that the NULL path case is (I believe)
      mainly intended to be used to implement a wrapper function like fstatx()
      that doesn't have a path argument, I think rejecting lookup-related
      flags too is probably the best choice.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8c7493aa
  6. 01 4月, 2017 3 次提交
    • M
      hugetlbfs: initialize shared policy as part of inode allocation · 4742a35d
      Mike Kravetz 提交于
      Any time after inode allocation, destroy_inode can be called.  The
      hugetlbfs inode contains a shared_policy structure, and
      mpol_free_shared_policy is unconditionally called as part of
      hugetlbfs_destroy_inode.  Initialize the policy as part of inode
      allocation so that any quick (error path) calls to destroy_inode will be
      handed an initialized policy.
      
      syzkaller fuzzer found this bug, that resulted in the following:
      
          BUG: KASAN: user-memory-access in atomic_inc
          include/asm-generic/atomic-instrumented.h:87 [inline] at addr
          000000131730bd7a
          BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
          kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
          Write of size 4 by task syz-executor6/14086
          CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
           __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
           lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
           __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
           _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
           mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
           hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
           alloc_inode+0x10d/0x180 fs/inode.c:216
           new_inode_pseudo+0x69/0x190 fs/inode.c:889
           new_inode+0x1c/0x40 fs/inode.c:918
           hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
           hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
           newseg+0x422/0xd30 ipc/shm.c:575
           ipcget_new ipc/util.c:285 [inline]
           ipcget+0x21e/0x580 ipc/util.c:639
           SYSC_shmget ipc/shm.c:673 [inline]
           SyS_shmget+0x158/0x230 ipc/shm.c:657
           entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      Analysis provided by Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      
      Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4742a35d
    • T
      nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type · f17f8a14
      Tigran Mkrtchyan 提交于
      this fix aims to fix dereferencing of a mirror in an error state when MDS
      returns unsupported DS type (IOW, not v3), which causes the following oops:
      
      [  220.370709] BUG: unable to handle kernel NULL pointer dereference at 0000000000000065
      [  220.370842] IP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
      [  220.370920] PGD 0
      
      [  220.370972] Oops: 0000 [#1] SMP
      [  220.371013] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security ebtable_filter ebtables ip6table_filter ip6_tables binfmt_misc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel btrfs kvm arc4 snd_hda_codec_hdmi iwldvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate mac80211 xor uvcvideo
      [  220.371814]  videobuf2_vmalloc videobuf2_memops snd_hda_codec_idt mei_wdt videobuf2_v4l2 snd_hda_codec_generic iTCO_wdt ppdev videobuf2_core iTCO_vendor_support dell_rbtn dell_wmi iwlwifi sparse_keymap dell_laptop dell_smbios snd_hda_intel dcdbas videodev snd_hda_codec dell_smm_hwmon snd_hda_core media cfg80211 intel_uncore snd_hwdep raid6_pq snd_seq intel_rapl_perf snd_seq_device joydev i2c_i801 rfkill lpc_ich snd_pcm parport_pc mei_me parport snd_timer dell_smo8800 mei snd shpchp soundcore tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 nouveau mxm_wmi ttm i2c_algo_bit drm_kms_helper crc32c_intel e1000e drm sdhci_pci firewire_ohci sdhci serio_raw mmc_core firewire_core ptp crc_itu_t pps_core wmi fjes video
      [  220.372568] CPU: 7 PID: 4988 Comm: cat Not tainted 4.10.5-200.fc25.x86_64 #1
      [  220.372647] Hardware name: Dell Inc. Latitude E6520/0J4TFW, BIOS A06 07/11/2011
      [  220.372729] task: ffff94791f6ea580 task.stack: ffffb72b88c0c000
      [  220.372802] RIP: 0010:ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
      [  220.372883] RSP: 0018:ffffb72b88c0f970 EFLAGS: 00010246
      [  220.372945] RAX: 0000000000000000 RBX: ffff9479015ca600 RCX: ffffffffffffffed
      [  220.373025] RDX: ffffffffffffffed RSI: ffff9479753dc980 RDI: 0000000000000000
      [  220.373104] RBP: ffffb72b88c0f988 R08: 000000000001c980 R09: ffffffffc0ea6112
      [  220.373184] R10: ffffef17477d9640 R11: ffff9479753dd6c0 R12: ffff9479211c7440
      [  220.373264] R13: ffff9478f45b7790 R14: 0000000000000001 R15: ffff9479015ca600
      [  220.373345] FS:  00007f555fa3e700(0000) GS:ffff9479753c0000(0000) knlGS:0000000000000000
      [  220.373435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  220.373506] CR2: 0000000000000065 CR3: 0000000196044000 CR4: 00000000000406e0
      [  220.373586] Call Trace:
      [  220.373627]  nfs4_ff_layout_prepare_ds+0x5e/0x200 [nfs_layout_flexfiles]
      [  220.373708]  ff_layout_pg_init_read+0x81/0x160 [nfs_layout_flexfiles]
      [  220.373806]  __nfs_pageio_add_request+0x11f/0x4a0 [nfs]
      [  220.373886]  ? nfs_create_request.part.14+0x37/0x330 [nfs]
      [  220.373967]  nfs_pageio_add_request+0xb2/0x260 [nfs]
      [  220.374042]  readpage_async_filler+0xaf/0x280 [nfs]
      [  220.374103]  read_cache_pages+0xef/0x1b0
      [  220.374166]  ? nfs_read_completion+0x210/0x210 [nfs]
      [  220.374239]  nfs_readpages+0x129/0x200 [nfs]
      [  220.374293]  __do_page_cache_readahead+0x1d0/0x2f0
      [  220.374352]  ondemand_readahead+0x17d/0x2a0
      [  220.374403]  page_cache_sync_readahead+0x2e/0x50
      [  220.374460]  generic_file_read_iter+0x6c8/0x950
      [  220.374532]  ? nfs_mapping_need_revalidate_inode+0x17/0x40 [nfs]
      [  220.374617]  nfs_file_read+0x6e/0xc0 [nfs]
      [  220.374670]  __vfs_read+0xe2/0x150
      [  220.374715]  vfs_read+0x96/0x130
      [  220.374758]  SyS_read+0x55/0xc0
      [  220.374801]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [  220.374856] RIP: 0033:0x7f555f570bd0
      [  220.374900] RSP: 002b:00007ffeb73e1b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [  220.374986] RAX: ffffffffffffffda RBX: 00007f555f839ae0 RCX: 00007f555f570bd0
      [  220.375066] RDX: 0000000000020000 RSI: 00007f555fa41000 RDI: 0000000000000003
      [  220.375145] RBP: 0000000000021010 R08: ffffffffffffffff R09: 0000000000000000
      [  220.375226] R10: 00007f555fa40010 R11: 0000000000000246 R12: 0000000000022000
      [  220.375305] R13: 0000000000021010 R14: 0000000000001000 R15: 0000000000002710
      [  220.375386] Code: 66 66 90 55 48 89 e5 41 54 53 49 89 fc 48 83 ec 08 48 85 f6 74 2e 48 8b 4e 30 48 89 f3 48 81 f9 00 f0 ff ff 77 1e 48 85 c9 74 15 <48> 83 79 78 00 b8 01 00 00 00 74 2c 48 83 c4 08 5b 41 5c 5d c3
      [  220.375653] RIP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles] RSP: ffffb72b88c0f970
      [  220.375748] CR2: 0000000000000065
      [  220.403538] ---[ end trace bcdca752211b7da9 ]---
      Signed-off-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      f17f8a14
    • O
      NFSv4.1 fix infinite loop on IO BAD_STATEID error · 0e3d3e5d
      Olga Kornievskaia 提交于
      Commit 63d63cbf "NFSv4.1: Don't recheck delegations that
      have already been checked" introduced a regression where when a
      client received BAD_STATEID error it would not send any TEST_STATEID
      and instead go into an infinite loop of resending the IO that caused
      the BAD_STATEID.
      
      Fixes: 63d63cbf ("NFSv4.1: Don't recheck delegations that have already been checked")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Cc: stable@vger.kernel.org # 4.9+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0e3d3e5d
  7. 31 3月, 2017 1 次提交
  8. 29 3月, 2017 3 次提交
    • D
      Btrfs: fix an integer overflow check · 457ae726
      Dan Carpenter 提交于
      This isn't super serious because you need CAP_ADMIN to run this code.
      
      I added this integer overflow check last year but apparently I am
      rubbish at writing integer overflow checks...  There are two issues.
      First, access_ok() works on unsigned long type and not u64 so on 32 bit
      systems the access_ok() could be checking a truncated size.  The other
      issue is that we should be using a stricter limit so we don't overflow
      the kzalloc() setting ctx->clone_roots later in the function after the
      access_ok():
      
      	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
      	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
      
      Fixes: f5ecec3c ("btrfs: send: silence an integer overflow warning")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      457ae726
    • G
      btrfs: Change qgroup_meta_rsv to 64bit · ce0dcee6
      Goldwyn Rodrigues 提交于
      Using an int value is causing qg->reserved to become negative and
      exclusive -EDQUOT to be reached prematurely.
      
      This affects exclusive qgroups only.
      
      TEST CASE:
      
      DEVICE=/dev/vdb
      MOUNTPOINT=/mnt
      SUBVOL=$MOUNTPOINT/tmp
      
      umount $SUBVOL
      umount $MOUNTPOINT
      
      mkfs.btrfs -f $DEVICE
      mount /dev/vdb $MOUNTPOINT
      btrfs quota enable $MOUNTPOINT
      btrfs subvol create $SUBVOL
      umount $MOUNTPOINT
      mount /dev/vdb $MOUNTPOINT
      mount -o subvol=tmp $DEVICE $SUBVOL
      btrfs qgroup limit -e 3G $SUBVOL
      
      btrfs quota rescan /mnt -w
      
      for i in `seq 1 44000`; do
        dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
        if [[ $? > 0 ]]; then
           btrfs qgroup show -pcref $SUBVOL
           exit 1
        fi
      done
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      [ add reproducer to changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce0dcee6
    • L
      Btrfs: bring back repair during read · 9d0d1c8b
      Liu Bo 提交于
      Commit 20a7db8a ("btrfs: add dummy callback for readpage_io_failed
      and drop checks") made a cleanup around readpage_io_failed_hook, and
      it was supposed to keep the original sematics, but it also
      unexpectedly disabled repair during read for dup, raid1 and raid10.
      
      This fixes the problem by letting data's inode call the generic
      readpage_io_failed callback by returning -EAGAIN from its
      readpage_io_failed_hook in order to notify end_bio_extent_readpage to
      do the rest.  We don't call it directly because the generic one takes
      an offset from end_bio_extent_readpage() to calculate the index in the
      checksum array and inode's readpage_io_failed_hook doesn't offer that
      offset.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ keep the const function attribute ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d0d1c8b
  9. 28 3月, 2017 4 次提交
  10. 26 3月, 2017 1 次提交