1. 27 7月, 2016 14 次提交
  2. 22 7月, 2016 2 次提交
    • M
      ovl: verify upper dentry in ovl_remove_and_whiteout() · cfc9fde0
      Maxim Patlasov 提交于
      The upper dentry may become stale before we call ovl_lock_rename_workdir.
      For example, someone could (mistakenly or maliciously) manually unlink(2)
      it directly from upperdir.
      
      To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
      and and check if it matches the upper dentry.
      
      Essentially, it is the same problem and similar solution as in
      commit 11f37104 ("ovl: verify upper dentry before unlink and rename").
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org>
      cfc9fde0
    • B
      GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes · e1cb6be9
      Bob Peterson 提交于
      Before this patch, if you used gfs2_jadd to add new journals of a
      size smaller than the existing journals, replaying those new journals
      would withdraw. That's because function gfs2_replay_incr_blk was
      using the number of journal blocks (jd_block) from the superblock's
      journal pointer. In other words, "My journal's max size" rather than
      "the journal we're replaying's size." This patch changes the function
      to use the size of the pertinent journal rather than always using the
      journal we happen to be using.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      e1cb6be9
  3. 16 7月, 2016 1 次提交
    • J
      xfs: fix type confusion in xfs_ioc_swapext · 3e0a3965
      Jann Horn 提交于
      Without this check, the following XFS_I invocations would return bad
      pointers when used on non-XFS inodes (perhaps pointers into preceding
      allocator chunks).
      
      This could be used by an attacker to trick xfs_swap_extents into
      performing locking operations on attacker-chosen structures in kernel
      memory, potentially leading to code execution in the kernel.  (I have
      not investigated how likely this is to be usable for an attack in
      practice.)
      Signed-off-by: NJann Horn <jann@thejh.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e0a3965
  4. 15 7月, 2016 1 次提交
    • H
      x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2 · 3ebfd81f
      H.J. Lu 提交于
      Don't use the same syscall numbers for 2 different syscalls:
      
       534	x32	preadv			compat_sys_preadv64
       535	x32	pwritev			compat_sys_pwritev64
       534	x32	preadv2			compat_sys_preadv2
       535	x32	pwritev2		compat_sys_pwritev2
      
      Add compat_sys_preadv64v2() and compat_sys_pwritev64v2() so that 64-bit offset
      is passed in one 64-bit register on x32, similar to compat_sys_preadv64()
      and compat_sys_pwritev64().
      Signed-off-by: NH.J. Lu <hjl.tools@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/CAMe9rOovCMf-RQfx_n1U_Tu_DX1BYkjtFr%3DQ4-_PFVSj9BCzUA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ebfd81f
  5. 14 7月, 2016 1 次提交
    • F
      chardev: add missing line break in pr_warn · 077e2642
      Fengguang Wu 提交于
      To fix super long dmesg error lines like
      
        CHRDEV "dummy_stm.0" major number 224 goes below the dynamic allocation rangeCHRDEV "dummy_stm.1" major number 223 goes below the dynamic allocation rangeswapper: page allocation failure: order:8, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
      
      After fix, it should look like
      
        CHRDEV "dummy_stm.0" major number 224 goes below the dynamic allocation range
        CHRDEV "dummy_stm.1" major number 223 goes below the dynamic allocation range
        swapper: page allocation failure: order:8, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
      Reported-by: NPhilip Li <philip.li@intel.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      077e2642
  6. 13 7月, 2016 1 次提交
    • B
      GFS2: Check rs_free with rd_rsspin protection · 44f52122
      Bob Peterson 提交于
      For the last process to close a file opened for write, function
      gfs2_rsqa_delete was deleting the file's inode's block reservation
      out of the rgrp reservations tree. Then it was checking to make sure
      rs_free was 0, but it was performing the check outside the protection
      of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
      to prevent a race between the process freeing the reservation and
      another who is allocating a new set of blocks inside the same rgrp
      for the same inode, thus changing its value.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      44f52122
  7. 08 7月, 2016 2 次提交
  8. 06 7月, 2016 2 次提交
  9. 04 7月, 2016 2 次提交
    • V
      ovl: Copy up underlying inode's ->i_mode to overlay inode · 07a2daab
      Vivek Goyal 提交于
      Right now when a new overlay inode is created, we initialize overlay
      inode's ->i_mode from underlying inode ->i_mode but we retain only
      file type bits (S_IFMT) and discard permission bits.
      
      This patch changes it and retains permission bits too. This should allow
      overlay to do permission checks on overlay inode itself in task context.
      
      [SzM] It also fixes clearing suid/sgid bits on write.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reported-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Cc: <stable@vger.kernel.org>
      07a2daab
    • M
      ovl: handle ATTR_KILL* · b99c2d91
      Miklos Szeredi 提交于
      Before 4bacc9c9 ("overlayfs: Make f_path...") file->f_path pointed to
      the underlying file, hence suid/sgid removal on write worked fine.
      
      After that patch file->f_path pointed to the overlay file, and the file
      mode bits weren't copied to overlay_inode->i_mode.  So the suid/sgid
      removal simply stopped working.
      
      The fix is to copy the mode bits, but then ovl_setattr() needs to clear
      ATTR_MODE to avoid the BUG() in notify_change().  So do this first, then in
      the next patch copy the mode.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Cc: <stable@vger.kernel.org>
      b99c2d91
  10. 03 7月, 2016 1 次提交
  11. 01 7月, 2016 5 次提交
    • M
      locks: use file_inode() · 6343a212
      Miklos Szeredi 提交于
      (Another one for the f_path debacle.)
      
      ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.
      
      The reason is that generic_add_lease() used filp->f_path.dentry->inode
      while all the others use file_inode().  This makes a difference for files
      opened on overlayfs since the former will point to the overlay inode the
      latter to the underlying inode.
      
      So generic_add_lease() added the lease to the overlay inode and
      generic_delete_lease() removed it from the underlying inode.  When the file
      was released the lease remained on the overlay inode's lock list, resulting
      in use after free.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6343a212
    • A
      namespace: update event counter when umounting a deleted dentry · e06b933e
      Andrey Ulanov 提交于
      - m_start() in fs/namespace.c expects that ns->event is incremented each
        time a mount added or removed from ns->list.
      - umount_tree() removes items from the list but does not increment event
        counter, expecting that it's done before the function is called.
      - There are some codepaths that call umount_tree() without updating
        "event" counter. e.g. from __detach_mounts().
      - When this happens m_start may reuse a cached mount structure that no
        longer belongs to ns->list (i.e. use after free which usually leads
        to infinite loop).
      
      This change fixes the above problem by incrementing global event counter
      before invoking umount_tree().
      
      Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrey Ulanov <andreyu@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e06b933e
    • M
      9p: use file_dentry() · b403f0e3
      Miklos Szeredi 提交于
      v9fs may be used as lower layer of overlayfs and accessing f_path.dentry
      can lead to a crash.  In this case it's a NULL pointer dereference in
      p9_fid_create().
      
      Fix by replacing direct access of file->f_path.dentry with the
      file_dentry() accessor, which will always return a native object.
      Reported-by: NAlessio Igor Bogani <alessioigorbogani@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: NAlessio Igor Bogani <alessioigorbogani@gmail.com>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b403f0e3
    • S
      lockd: unregister notifier blocks if the service fails to come up completely · cb7d224f
      Scott Mayhew 提交于
      If the lockd service fails to start up then we need to be sure that the
      notifier blocks are not registered, otherwise a subsequent start of the
      service could cause the same notifier to be registered twice, leading to
      soft lockups.
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 0751ddf7 "lockd: Register callbacks on the inetaddr_chain..."
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      cb7d224f
    • T
      writeback: inode cgroup wb switch should not call ihold() · 74524955
      Tahsin Erdogan 提交于
      Asynchronous wb switching of inodes takes an additional ref count on an
      inode to make sure inode remains valid until switchover is completed.
      
      However, anyone calling ihold() must already have a ref count on inode,
      but in this case inode->i_count may already be zero:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
      CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: writeback wb_workfn (flush-8:16)
       0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
       0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
       ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
      Call Trace:
       [<ffffffff805990af>] dump_stack+0x4d/0x6e
       [<ffffffff80268702>] __warn+0xc2/0xe0
       [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
       [<ffffffff8035b4ab>] ihold+0x2b/0x30
       [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
       [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
       [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
       [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
       [<ffffffff8036a147>] wb_workfn+0xd7/0x280
       [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
       [<ffffffff8027bb09>] process_one_work+0x129/0x300
       [<ffffffff8027be06>] worker_thread+0x126/0x480
       [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
       [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
       [<ffffffff80280ff4>] kthread+0xc4/0xe0
       [<ffffffff80335578>] ? kfree+0xc8/0x100
       [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
       [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
      ---[ end trace aaefd2fd9f306bc4 ]---
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74524955
  12. 30 6月, 2016 2 次提交
    • M
      fuse: serialize dirops by default · 5c672ab3
      Miklos Szeredi 提交于
      Negotiate with userspace filesystems whether they support parallel readdir
      and lookup.  Disable parallelism by default for fear of breaking fuse
      filesystems.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 9902af79 ("parallel lookups: actual switch to rwsem")
      Fixes: d9b3dbdc ("fuse: switch to ->iterate_shared()")
      5c672ab3
    • M
      configfs: Remove ppos increment in configfs_write_bin_file · f8608985
      Marek Vasut 提交于
      The simple_write_to_buffer() already increments the @ppos on success,
      see fs/libfs.c simple_write_to_buffer() comment:
      
      "
      On success, the number of bytes written is returned and the offset @ppos
      advanced by this number, or negative value is returned on error.
      "
      
      If the configfs_write_bin_file() is invoked with @count smaller than the
      total length of the written binary file, it will be invoked multiple times.
      Since configfs_write_bin_file() increments @ppos on success, after calling
      simple_write_to_buffer(), the @ppos is incremented twice.
      
      Subsequent invocation of configfs_write_bin_file() will result in the next
      piece of data being written to the offset twice as long as the length of
      the previous write, thus creating buffer with "holes" in it.
      
      The simple testcase using DTO follows:
        $ mkdir /sys/kernel/config/device-tree/overlays/1
        $ dd bs=1 if=foo.dtbo of=/sys/kernel/config/device-tree/overlays/1/dtbo
      Without this patch, the testcase will result in twice as big buffer in the
      kernel, which is then passed to the cfs_overlay_item_dtbo_write() .
      Signed-off-by: NMarek Vasut <marex@denx.de>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
      f8608985
  13. 29 6月, 2016 3 次提交
    • M
      ovl: get_write_access() in truncate · 03bea604
      Miklos Szeredi 提交于
      When truncating a file we should check write access on the underlying
      inode.  And we should do so on the lower file as well (before copy-up) for
      consistency.
      
      Original patch and test case by Aihua Zhang.
      
       - - >o >o - - test.c - - >o >o - -
      #include <stdio.h>
      #include <errno.h>
      #include <unistd.h>
      
      int main(int argc, char *argv[])
      {
      	int ret;
      
      	ret = truncate(argv[0], 4096);
      	if (ret != -1) {
      		fprintf(stderr, "truncate(argv[0]) should have failed\n");
      		return 1;
      	}
      	if (errno != ETXTBSY) {
      		perror("truncate(argv[0])");
      		return 1;
      	}
      
      	return 0;
      }
       - - >o >o - - >o >o - - >o >o - -
      Reported-by: NAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org>
      03bea604
    • M
      ovl: fix dentry leak for default_permissions · a4859d75
      Miklos Szeredi 提交于
      When using the 'default_permissions' mount option, ovl_permission() on
      non-directories was missing a dput(alias), resulting in "BUG Dentry still
      in use".
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 8d3095f4 ("ovl: default permissions")
      Cc: <stable@vger.kernel.org> # v4.5+
      a4859d75
    • T
      NFS: Fix another OPEN_DOWNGRADE bug · e547f262
      Trond Myklebust 提交于
      Olga Kornievskaia reports that the following test fails to trigger
      an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
      
      	fd0 = open(foo, RDRW)   -- should be open on the wire for "both"
      	fd1 = open(foo, RDONLY)  -- should be open on the wire for "read"
      	close(fd0) -- should trigger an open_downgrade
      	read(fd1)
      	close(fd1)
      
      The issue is that we're missing a check for whether or not the current
      state transitioned from an O_RDWR state as opposed to having transitioned
      from a combination of O_RDONLY and O_WRONLY.
      Reported-by: NOlga Kornievskaia <aglo@umich.edu>
      Fixes: cd9288ff ("NFSv4: Fix another bug in the close/open_downgrade code")
      Cc: stable@vger.kernel.org # 2.6.33+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e547f262
  14. 28 6月, 2016 1 次提交
    • E
      dax: fix offset overflow in dax_io · 02395435
      Eric Sandeen 提交于
      This isn't functionally apparent for some reason, but
      when we test io at extreme offsets at the end of the loff_t
      rang, such as in fstests xfs/071, the calculation of
      "max" in dax_io() can be wrong due to pos + size overflowing.
      
      For example,
      
      # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file
      
      enters dax_io with:
      
      start 0x7ffffffffffff000
      end   0x7ffffffffffff200
      
      and the rounded up "size" variable is 0x1000.  This yields:
      
      pos + size 0x8000000000000000 (overflows loff_t)
             end 0x7ffffffffffff200
      
      Due to the overflow, the min() function picks the wrong
      value for the "max" variable, and when we send (max - pos)
      into i.e. copy_from_iter_pmem() it is also the wrong value.
      
      This somehow(tm) gets magically absorbed without incident,
      probably because iter->count is correct.  But it seems best
      to fix it up properly by comparing the two values as
      unsigned.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      02395435
  15. 27 6月, 2016 2 次提交
    • B
      gfs2: writeout truncated pages · fd4c5748
      Benjamin Marzinski 提交于
      When gfs2 attempts to write a page to a file that is being truncated,
      and notices that the page is completely outside of the file size, it
      tries to invalidate it.  However, this may require a transaction for
      journaled data files to revoke any buffers from the page on the active
      items list. Unfortunately, this can happen inside a log flush, where a
      transaction cannot be started. Also, gfs2 may need to be able to remove
      the buffer from the ail1 list before it can finish the log flush.
      
      To deal with this, when writing a page of a file with data journalling
      enabled gfs2 now skips the check to see if the write is outside the file
      size, and simply writes it anyway. This situation can only occur when
      the truncate code still has the file locked exclusively, and hasn't
      marked this block as free in the metadata (which happens later in
      truc_dealloc).  After gfs2 writes this page out, the truncation code
      will shortly invalidate it and write out any revokes if necessary.
      
      To do this, gfs2 now implements its own version of block_write_full_page
      without the check, and calls the newly exported __block_write_full_page.
      It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      fd4c5748
    • B
      fs: export __block_write_full_page · b4bba389
      Benjamin Marzinski 提交于
      gfs2 needs to be able to skip the check to see if a page is outside of
      the file size when writing it out. gfs2 can get into a situation where
      it needs to flush its in-memory log to disk while a truncate is in
      progress. If the file being trucated has data journaling enabled, it is
      possible that there are data blocks in the log that are past the end of
      the file. gfs can't finish the log flush without either writing these
      blocks out or revoking them. Otherwise, if the node crashed, it could
      overwrite subsequent changes made by other nodes in the cluster when
      it's journal was replayed.
      
      Unfortunately, there is no way to add log entries to the log during a
      flush. So gfs2 simply writes out the page instead. This situation can
      only occur when the truncate code still has the file locked exclusively,
      and hasn't marked this block as free in the metadata (which happens
      later in truc_dealloc).  After gfs2 writes this page out, the truncation
      code will shortly invalidate it and write out any revokes if necessary.
      
      In order to make this work, gfs2 needs to be able to skip the check for
      writes outside the file size. Since the check exists in
      block_write_full_page, this patch exports __block_write_full_page, which
      doesn't have the check.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b4bba389