1. 06 7月, 2017 1 次提交
    • J
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton 提交于
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5660e13d
  2. 09 5月, 2017 1 次提交
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424
  3. 04 5月, 2017 1 次提交
  4. 02 5月, 2017 1 次提交
  5. 26 4月, 2017 2 次提交
  6. 22 4月, 2017 1 次提交
    • I
      block: get rid of blk_integrity_revalidate() · 19b7ccf8
      Ilya Dryomov 提交于
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
      
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                          BDI_CAP_STABLE_WRITES;
          else
                  disk->queue->backing_dev_info->capabilities &=
                          ~BDI_CAP_STABLE_WRITES;
      
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.4+, needs backporting
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      19b7ccf8
  7. 21 4月, 2017 2 次提交
    • D
      dax: introduce dax_direct_access() · b0686260
      Dan Williams 提交于
      Replace bdev_direct_access() with dax_direct_access() that uses
      dax_device and dax_operations instead of a block_device and
      block_device_operations for dax. Once all consumers of the old api have
      been converted bdev_direct_access() will be deleted.
      
      Given that block device partitioning decisions can cause dax page
      alignment constraints to be violated this also introduces the
      bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
      to the dax_device and also checks for page alignment.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b0686260
    • D
      block: kill bdev_dax_capable() · d8f07aee
      Dan Williams 提交于
      This is leftover dead code that has since been replaced by
      bdev_dax_supported().
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d8f07aee
  8. 09 4月, 2017 2 次提交
  9. 23 3月, 2017 2 次提交
  10. 02 3月, 2017 1 次提交
    • J
      block: Initialize bd_bdi on inode initialization · a5a79d00
      Jan Kara 提交于
      So far we initialized bd_bdi only in bdget(). That is fine for normal
      bdev inodes however for the special case of the root inode of
      blockdev_superblock that function is never called and thus bd_bdi is
      left uninitialized. As a result bdev_evict_inode() may oops doing
      bdi_put(root->bd_bdi) on that inode as can be seen when doing:
      
      mount -t bdev none /mnt
      
      Fix the problem by initializing bd_bdi when first allocating the inode
      and then reinitializing bd_bdi in bdev_evict_inode().
      
      Thanks to syzkaller team for finding the problem.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b1d2dc56 ("block: Make blk_get_backing_dev_info() safe without open bdev")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5a79d00
  11. 28 2月, 2017 1 次提交
  12. 22 2月, 2017 1 次提交
    • J
      block: Revalidate i_bdev reference in bd_aquire() · cccd9fb9
      Jan Kara 提交于
      When a device gets removed, block device inode unhashed so that it is not
      used anymore (bdget() will not find it anymore). Later when a new device
      gets created with the same device number, we create new block device
      inode. However there may be file system device inodes whose i_bdev still
      points to the original block device inode and thus we get two active
      block device inodes for the same device. They will share the same
      gendisk so the only visible differences will be that page caches will
      not be coherent and BDIs will be different (the old block device inode
      still points to unregistered BDI).
      
      Fix the problem by checking in bd_acquire() whether i_bdev still points
      to active block device inode and re-lookup the block device if not. That
      way any open of a block device happening after the old device has been
      removed will get correct block device inode.
      Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cccd9fb9
  13. 02 2月, 2017 2 次提交
    • J
      block: Make blk_get_backing_dev_info() safe without open bdev · b1d2dc56
      Jan Kara 提交于
      Currenly blk_get_backing_dev_info() is not safe to be called when the
      block device is not open as bdev->bd_disk is NULL in that case. However
      inode_to_bdi() uses this function and may be call called from flusher
      worker or other writeback related functions without bdev being open
      which leads to crashes such as:
      
      [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
      [113031.075614] Faulting instruction address: 0xc0000000003692e0
      0:mon> t
      [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
      [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
      [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
      [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
      [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
      [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
      [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
      [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1d2dc56
    • J
      block: Unhash block device inodes on gendisk destruction · f44f1ab5
      Jan Kara 提交于
      Currently, block device inodes stay around after corresponding gendisk
      hash died until memory reclaim finds them and frees them. Since we will
      make block device inode pin the bdi, we want to free the block device
      inode as soon as the device goes away so that bdi does not stay around
      unnecessarily. Furthermore we need to avoid issues when new device with
      the same major,minor pair gets created since reusing the bdi structure
      would be rather difficult in this case.
      
      Unhashing block device inode on gendisk destruction nicely deals with
      these problems. Once last block device inode reference is dropped (which
      may be directly in del_gendisk()), the inode gets evicted. Furthermore if
      the major,minor pair gets reallocated, we are guaranteed to get new
      block device inode even if old block device inode is not yet evicted and
      thus we avoid issues with possible reuse of bdi.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f44f1ab5
  14. 24 1月, 2017 1 次提交
  15. 25 12月, 2016 1 次提交
  16. 23 12月, 2016 1 次提交
  17. 14 12月, 2016 2 次提交
    • S
      block_dev: don't update file access position for sync direct IO · 7a62a523
      Shaohua Li 提交于
      For sync direct IO, generic_file_direct_write/generic_file_read_iter
      will update file access position. Don't duplicate the update in
      .direct_IO. This cause my raid array can't assemble.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7a62a523
    • N
      block_dev: don't test bdev->bd_contains when it is not stable · bcc7f5b4
      NeilBrown 提交于
      bdev->bd_contains is not stable before calling __blkdev_get().
      When __blkdev_get() is called on a parition with ->bd_openers == 0
      it sets
        bdev->bd_contains = bdev;
      which is not correct for a partition.
      After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
      and then ->bd_contains is stable.
      
      When FMODE_EXCL is used, blkdev_get() calls
         bd_start_claiming() ->  bd_prepare_to_claim() -> bd_may_claim()
      
      This call happens before __blkdev_get() is called, so ->bd_contains
      is not stable.  So bd_may_claim() cannot safely use ->bd_contains.
      It currently tries to use it, and this can lead to a BUG_ON().
      
      This happens when a whole device is already open with a bd_holder (in
      use by dm in my particular example) and two threads race to open a
      partition of that device for the first time, one opening with O_EXCL and
      one without.
      
      The thread that doesn't use O_EXCL gets through blkdev_get() to
      __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;
      
      Immediately thereafter the other thread, using FMODE_EXCL, calls
      bd_start_claiming() from blkdev_get().  This should fail because the
      whole device has a holder, but because bdev->bd_contains == bdev
      bd_may_claim() incorrectly reports success.
      This thread continues and blocks on bd_mutex.
      
      The first thread then sets bdev->bd_contains correctly and drops the mutex.
      The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
      again in:
      			BUG_ON(!bd_may_claim(bdev, whole, holder));
      The BUG_ON fires.
      
      Fix this by removing the dependency on ->bd_contains in
      bd_may_claim().  As bd_may_claim() has direct access to the whole
      device, it can simply test if the target bdev is the whole device.
      
      Fixes: 6b4517a7 ("block: implement bd_claiming and claiming block")
      Cc: stable@vger.kernel.org (v2.6.35+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bcc7f5b4
  18. 01 12月, 2016 1 次提交
    • R
      block: protect iterate_bdevs() against concurrent close · af309226
      Rabin Vincent 提交于
      If a block device is closed while iterate_bdevs() is handling it, the
      following NULL pointer dereference occurs because bdev->b_disk is NULL
      in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
      turn called by the mapping_cap_writeback_dirty() call in
      __filemap_fdatawrite_range()):
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
       IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       PGD 9e62067 PUD 9ee8067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
       task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
       RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
       RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
       RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
       R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
       FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
       Stack:
        ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
        0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
        ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
       Call Trace:
        [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
        [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
        [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
        [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
        [<ffffffff811b2763>] sys_sync+0x63/0x90
        [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
       Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
       RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
        RSP <ffff880009f5fe68>
       CR2: 0000000000000508
       ---[ end trace 2487336ceb3de62d ]---
      
      The crash is easily reproducible by running the following command, if an
      msleep(100) is inserted before the call to func() in iterate_devs():
      
       while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done
      
      Fix it by holding the bd_mutex across the func() call and only calling
      func() if the bdev is opened.
      
      Cc: stable@vger.kernel.org
      Fixes: 5c0d6b60 ("vfs: Create function for iterating over block devices")
      Reported-and-tested-by: NWei Fang <fangwei1@huawei.com>
      Signed-off-by: NRabin Vincent <rabinv@axis.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af309226
  19. 22 11月, 2016 3 次提交
  20. 18 11月, 2016 4 次提交
  21. 12 10月, 2016 1 次提交
  22. 06 10月, 2016 1 次提交
  23. 14 9月, 2016 1 次提交
  24. 25 8月, 2016 1 次提交
  25. 22 8月, 2016 1 次提交
    • V
      bdev: fix NULL pointer dereference · e9e5e3fa
      Vegard Nossum 提交于
      I got this:
      
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          Dumping ftrace buffer:
             (ftrace buffer empty)
          CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
          task: ffff880113415940 task.stack: ffff880118350000
          RIP: 0010:[<ffffffff8172cb32>]  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
          RSP: 0018:ffff880118357ca0  EFLAGS: 00010207
          RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
          RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
          RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
          R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
          FS:  00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
          DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
          Stack:
           ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
           ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
           0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
          Call Trace:
           [<ffffffff8167f207>] mount_fs+0x97/0x2e0
           [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760
           [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300
           [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50
           [<ffffffff816dfc87>] do_mount+0x3d7/0x2730
           [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0
           [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40
           [<ffffffff8161ea81>] ? memset+0x31/0x40
           [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320
           [<ffffffff816e2a02>] SyS_mount+0xb2/0x120
           [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970
           [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
           [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25
          Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
          RIP  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
           RSP <ffff880118357ca0>
          ---[ end trace 13690ad962168b98 ]---
      
      mount_pseudo() returns ERR_PTR(), not NULL, on error.
      
      Fixes: 3684aa70 ("block-dev: enable writeback cgroup support")
      Cc: Shaohua Li <shli@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9e5e3fa
  26. 08 8月, 2016 1 次提交
  27. 05 8月, 2016 1 次提交
  28. 04 8月, 2016 1 次提交
  29. 21 7月, 2016 1 次提交