1. 25 6月, 2016 7 次提交
  2. 24 6月, 2016 4 次提交
    • C
      Btrfs: Force stripesize to the value of sectorsize · b7f67055
      Chandan Rajendra 提交于
      Btrfs code currently assumes stripesize to be same as
      sectorsize. However Btrfs-progs (until commit
      df05c7ed455f519e6e15e46196392e4757257305) has been setting
      btrfs_super_block->stripesize to a value of 4096.
      
      This commit makes sure that the value of btrfs_super_block->stripesize
      is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
      to sectorsize.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b7f67055
    • W
      btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610
      Wang Xiaoguang 提交于
      When doing truncate operation, btrfs_setsize() will first call
      truncate_setsize() to set new inode->i_size, but if later
      btrfs_truncate() fails, btrfs_setsize() will call
      "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
      inmemory inode size, now bug occurs. It's because for truncate
      case btrfs_ordered_update_i_size() directly uses inode->i_size
      to update BTRFS_I(inode)->disk_i_size, indeed we should use the
      "offset" argument to update disk_i_size. Here is the call graph:
      ==>btrfs_truncate()
      ====>btrfs_truncate_inode_items()
      ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
      Here btrfs_ordered_update_i_size()'s offset argument is last_size.
      
      And below test case can reveal this bug:
      
      dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
      dev=$(losetup --show -f fs.img)
      mkdir -p /mnt/mntpoint
      mkfs.btrfs  -f $dev
      mount $dev /mnt/mntpoint
      cd /mnt/mntpoint
      
      echo "workdir is: /mnt/mntpoint"
      blocksize=$((128 * 1024))
      dd if=/dev/zero of=testfile bs=$blocksize count=1
      sync
      count=$((17*1024*1024*1024/blocksize))
      echo "file size is:" $((count*blocksize))
      for ((i = 1; i <= $count; i++)); do
      	i=$((i + 1))
      	dst_offset=$((blocksize * i))
      	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
      		testfile > /dev/null
      done
      sync
      
      truncate --size 0 testfile
      ls -l testfile
      du -sh testfile
      exit
      
      In this case, truncate operation will fail for enospc reason and
      "du -sh testfile" returns value greater than 0, but testfile's
      size is 0, we need to reflect correct inode->i_size.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d2f610
    • L
      Btrfs: fix error handling in map_private_extent_buffer · 415b35a5
      Liu Bo 提交于
      map_private_extent_buffer() can return -EINVAL in two different cases,
      1. when the requested contents span two pages if nodesize is larger
         than pagesize,
      2. when it detects something insane.
      
      The 2nd one used to be only a WARN_ON(1), and we decided to return a error
      to callers, but we didn't fix up all its callers, which will be
      addressed by this patch.
      
      Without this, btrfs may end up with 'general protection', ie.
      reading invalid memory.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      415b35a5
    • W
      Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a
      Wei Yongjun 提交于
      Fix to return a negative error code from the kern_mount() error handling
      case instead of 0(ret is set to 0 by register_filesystem), as done
      elsewhere in this function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04e1b65a
  3. 23 6月, 2016 4 次提交
    • J
      Btrfs: don't do nocow check unless we have to · c6887cd1
      Josef Bacik 提交于
      Before we write into prealloc/nocow space we have to make sure that there are no
      references to the extents we are writing into, which means checking the extent
      tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
      unless we can't reserve data space, since it's a serious drag on performance.
      With the following sequence
      
      fallocate -l10737418240 /mnt/btrfs-test/file
      cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
      fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
      	--end_fsync=1
      
      we get the worst case scenario where we have to fall back on to doing the check
      anyway.
      
      Without this patch
      lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
      write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
      
      With this patch
      lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
      write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
      
      We get twice the throughput, half of the runtime, and half of the average
      latency.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      [ PAGE_CACHE_ removal related fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c6887cd1
    • C
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason 提交于
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f873eca
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
    • K
      UBIFS: Implement ->migratepage() · 4ac1c17b
      Kirill A. Shutemov 提交于
      During page migrations UBIFS might get confused
      and the following assert triggers:
      [  213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
      [  213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
      [  213.490000] Hardware name: Allwinner sun4i/sun5i Families
      [  213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
      [  213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
      [  213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
      [  213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
      [  213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
      [  213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
      [  213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
      [  213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
      [  213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
      [  213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
      [  213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
      [  213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
      [  213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
      [  213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
      [  213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
      [  213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
      [  213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
      [  213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
      [  213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)
      
      UBIFS is using PagePrivate() which can have different meanings across
      filesystems. Therefore the generic page migration code cannot handle this
      case correctly.
      We have to implement our own migration function which basically does a
      plain copy but also duplicates the page private flag.
      UBIFS is not a block device filesystem and cannot use buffer_migrate_page().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [rw: Massaged changelog, build fixes, etc...]
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      4ac1c17b
  4. 20 6月, 2016 1 次提交
  5. 18 6月, 2016 8 次提交
  6. 16 6月, 2016 3 次提交
  7. 15 6月, 2016 5 次提交
    • J
      nfsd4/rpc: move backchannel create logic into rpc code · d50039ea
      J. Bruce Fields 提交于
      Also simplify the logic a bit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Acked-by: NTrond Myklebust <trondmy@primarydata.com>
      d50039ea
    • M
      ovl: fix uid/gid when creating over whiteout · d0e13f5b
      Miklos Szeredi 提交于
      Fix a regression when creating a file over a whiteout.  The new
      file/directory needs to use the current fsuid/fsgid, not the ones from the
      mounter's credentials.
      
      The refcounting is a bit tricky: prepare_creds() sets an original refcount,
      override_creds() gets one more, which revert_cred() drops.  So
      
        1) we need to expicitly put the mounter's credentials when overriding
           with the updated one
      
        2) we need to put the original ref to the updated creds (and this can
           safely be done before revert_creds(), since we'll still have the ref
           from override_creds()).
      Reported-by: NStephen Smalley <sds@tycho.nsa.gov>
      Fixes: 3fe6e52f ("ovl: override creds with the ones from the superblock mounter")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d0e13f5b
    • N
      debugfs: open_proxy_open(): avoid double fops release · 75f0b68b
      Nicolai Stange 提交于
      Debugfs' open_proxy_open(), the ->open() installed at all inodes created
      through debugfs_create_file_unsafe(),
      - grabs a reference to the original file_operations instance passed to
        debugfs_create_file_unsafe() via fops_get(),
      - installs it at the file's ->f_op by means of replace_fops()
      - and calls fops_put() on it.
      
      Since the semantics of replace_fops() are such that the reference's
      ownership is transferred, the subsequent fops_put() will result in a double
      release when the file is eventually closed.
      
      Currently, this is not an issue since fops_put() basically does a
      module_put() on the file_operations' ->owner only and there don't exist any
      modules calling debugfs_create_file_unsafe() yet. This is expected to
      change in the future though, c.f. commit c6468808 ("debugfs: add
      support for self-protecting attribute file fops").
      
      Remove the call to fops_put() from open_proxy_open().
      
      Fixes: 9fd4dcec ("debugfs: prevent access to possibly dead
                            file_operations at file open")
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75f0b68b
    • N
      debugfs: full_proxy_open(): free proxy on ->open() failure · b10e3e90
      Nicolai Stange 提交于
      Debugfs' full_proxy_open(), the ->open() installed at all inodes created
      through debugfs_create_file(),
      - grabs a reference to the original struct file_operations instance passed
        to debugfs_create_file(),
      - dynamically allocates a proxy struct file_operations instance wrapping
        the original
      - and installs this at the file's ->f_op.
      
      Afterwards, it calls the original ->open() and passes its return value back
      to the VFS layer.
      
      Now, if that return value indicates failure, the VFS layer won't ever call
      ->release() and thus, neither the reference to the original file_operations
      nor the memory for the proxy file_operations will get released, i.e. both
      are leaked.
      
      Upon failure of the original fops' ->open(), undo the proxy installation.
      That is:
      - Set the struct file ->f_op to what it had been when full_proxy_open()
        was entered.
      - Drop the reference to the original file_operations.
      - Free the memory holding the proxy file_operations.
      
      Fixes: 49d200de ("debugfs: prevent access to removed files' private
                            data")
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b10e3e90
    • E
      mnt: Account for MS_RDONLY in fs_fully_visible · 695e9df0
      Eric W. Biederman 提交于
      In rare cases it is possible for s_flags & MS_RDONLY to be set but
      MNT_READONLY to be clear.  This starting combination can cause
      fs_fully_visible to fail to ensure that the new mount is readonly.
      Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
      is set on the source filesystem of the mount.
      
      In general both MS_RDONLY and MNT_READONLY are set at the same for
      mounts so I don't expect any programs to care.  Nor do I expect
      MS_RDONLY to be set on proc or sysfs in the initial user namespace,
      which further decreases the likelyhood of problems.
      
      Which means this change should only affect system configurations by
      paranoid sysadmins who should welcome the additional protection
      as it keeps people from wriggling out of their policies.
      
      Cc: stable@vger.kernel.org
      Fixes: 8c6cf9cc ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      695e9df0
  8. 14 6月, 2016 1 次提交
  9. 12 6月, 2016 1 次提交
    • A
      autofs races · ea01a184
      Al Viro 提交于
      * make autofs4_expire_indirect() skip the dentries being in process of
      expiry
      * do *not* mess with list_move(); making sure that dentry with
      AUTOFS_INF_EXPIRING are not picked for expiry is enough.
      * do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
      there.  Clear it at the same time we clear EXPIRING.  Makes a bunch of
      tests simpler.
      * rename NO_RCU to WANT_EXPIRE, which is what it really is.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea01a184
  10. 11 6月, 2016 2 次提交
  11. 10 6月, 2016 1 次提交
    • A
      much milder d_walk() race · ba65dc5e
      Al Viro 提交于
      d_walk() relies upon the tree not getting rearranged under it without
      rename_lock being touched.  And we do grab rename_lock around the
      places that change the tree topology.  Unfortunately, branch reordering
      is just as bad from d_walk() POV and we have two places that do it
      without touching rename_lock - one in handling of cursors (for ramfs-style
      directories) and another in autofs.  autofs one is a separate story; this
      commit deals with the cursors.
      	* mark cursor dentries explicitly at allocation time
      	* make __dentry_kill() leave ->d_child.next pointing to the next
      non-cursor sibling, making sure that it won't be moved around unnoticed
      before the parent is relocked on ascend-to-parent path in d_walk().
      	* make d_walk() skip cursors explicitly; strictly speaking it's
      not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
      but it makes analysis easier.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ba65dc5e
  12. 08 6月, 2016 3 次提交
    • M
      coredump: fix dumping through pipes · 1607f09c
      Mateusz Guzik 提交于
      The offset in the core file used to be tracked with ->written field of
      the coredump_params structure. The field was retired in favour of
      file->f_pos.
      
      However, ->f_pos is not maintained for pipes which leads to breakage.
      
      Restore explicit tracking of the offset in coredump_params. Introduce
      ->pos field for this purpose since ->written was already reused.
      
      Fixes: a0083939 ("get rid of coredump_params->written").
      Reported-by: NZbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1607f09c
    • A
      fix a regression in atomic_open() · a01e718f
      Al Viro 提交于
      open("/foo/no_such_file", O_RDONLY | O_CREAT) on should fail with
      EACCES when /foo is not writable; failing with ENOENT is obviously
      wrong.  That got broken by a braino introduced when moving the
      creat_error logics from atomic_open() to lookup_open().  Easy to
      fix, fortunately.
      Spotted-by: N"Yan, Zheng" <ukernel@gmail.com>
      Tested-by: N"Yan, Zheng" <ukernel@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a01e718f
    • A
      fix d_walk()/non-delayed __d_free() race · 3d56c25e
      Al Viro 提交于
      Ascend-to-parent logics in d_walk() depends on all encountered child
      dentries not getting freed without an RCU delay.  Unfortunately, in
      quite a few cases it is not true, with hard-to-hit oopsable race as
      the result.
      
      Fortunately, the fix is simiple; right now the rule is "if it ever
      been hashed, freeing must be delayed" and changing it to "if it
      ever had a parent, freeing must be delayed" closes that hole and
      covers all cases the old rule used to cover.  Moreover, pipes and
      sockets remain _not_ covered, so we do not introduce RCU delay in
      the cases which are the reason for having that delay conditional
      in the first place.
      
      Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3d56c25e