1. 23 3月, 2017 2 次提交
  2. 02 3月, 2017 1 次提交
    • J
      block: Initialize bd_bdi on inode initialization · a5a79d00
      Jan Kara 提交于
      So far we initialized bd_bdi only in bdget(). That is fine for normal
      bdev inodes however for the special case of the root inode of
      blockdev_superblock that function is never called and thus bd_bdi is
      left uninitialized. As a result bdev_evict_inode() may oops doing
      bdi_put(root->bd_bdi) on that inode as can be seen when doing:
      
      mount -t bdev none /mnt
      
      Fix the problem by initializing bd_bdi when first allocating the inode
      and then reinitializing bd_bdi in bdev_evict_inode().
      
      Thanks to syzkaller team for finding the problem.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b1d2dc56 ("block: Make blk_get_backing_dev_info() safe without open bdev")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5a79d00
  3. 28 2月, 2017 1 次提交
  4. 22 2月, 2017 1 次提交
    • J
      block: Revalidate i_bdev reference in bd_aquire() · cccd9fb9
      Jan Kara 提交于
      When a device gets removed, block device inode unhashed so that it is not
      used anymore (bdget() will not find it anymore). Later when a new device
      gets created with the same device number, we create new block device
      inode. However there may be file system device inodes whose i_bdev still
      points to the original block device inode and thus we get two active
      block device inodes for the same device. They will share the same
      gendisk so the only visible differences will be that page caches will
      not be coherent and BDIs will be different (the old block device inode
      still points to unregistered BDI).
      
      Fix the problem by checking in bd_acquire() whether i_bdev still points
      to active block device inode and re-lookup the block device if not. That
      way any open of a block device happening after the old device has been
      removed will get correct block device inode.
      Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cccd9fb9
  5. 02 2月, 2017 2 次提交
    • J
      block: Make blk_get_backing_dev_info() safe without open bdev · b1d2dc56
      Jan Kara 提交于
      Currenly blk_get_backing_dev_info() is not safe to be called when the
      block device is not open as bdev->bd_disk is NULL in that case. However
      inode_to_bdi() uses this function and may be call called from flusher
      worker or other writeback related functions without bdev being open
      which leads to crashes such as:
      
      [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
      [113031.075614] Faulting instruction address: 0xc0000000003692e0
      0:mon> t
      [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
      [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
      [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
      [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
      [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
      [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
      [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
      [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1d2dc56
    • J
      block: Unhash block device inodes on gendisk destruction · f44f1ab5
      Jan Kara 提交于
      Currently, block device inodes stay around after corresponding gendisk
      hash died until memory reclaim finds them and frees them. Since we will
      make block device inode pin the bdi, we want to free the block device
      inode as soon as the device goes away so that bdi does not stay around
      unnecessarily. Furthermore we need to avoid issues when new device with
      the same major,minor pair gets created since reusing the bdi structure
      would be rather difficult in this case.
      
      Unhashing block device inode on gendisk destruction nicely deals with
      these problems. Once last block device inode reference is dropped (which
      may be directly in del_gendisk()), the inode gets evicted. Furthermore if
      the major,minor pair gets reallocated, we are guaranteed to get new
      block device inode even if old block device inode is not yet evicted and
      thus we avoid issues with possible reuse of bdi.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f44f1ab5
  6. 24 1月, 2017 1 次提交
  7. 25 12月, 2016 1 次提交
  8. 23 12月, 2016 1 次提交
  9. 14 12月, 2016 2 次提交
    • S
      block_dev: don't update file access position for sync direct IO · 7a62a523
      Shaohua Li 提交于
      For sync direct IO, generic_file_direct_write/generic_file_read_iter
      will update file access position. Don't duplicate the update in
      .direct_IO. This cause my raid array can't assemble.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7a62a523
    • N
      block_dev: don't test bdev->bd_contains when it is not stable · bcc7f5b4
      NeilBrown 提交于
      bdev->bd_contains is not stable before calling __blkdev_get().
      When __blkdev_get() is called on a parition with ->bd_openers == 0
      it sets
        bdev->bd_contains = bdev;
      which is not correct for a partition.
      After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
      and then ->bd_contains is stable.
      
      When FMODE_EXCL is used, blkdev_get() calls
         bd_start_claiming() ->  bd_prepare_to_claim() -> bd_may_claim()
      
      This call happens before __blkdev_get() is called, so ->bd_contains
      is not stable.  So bd_may_claim() cannot safely use ->bd_contains.
      It currently tries to use it, and this can lead to a BUG_ON().
      
      This happens when a whole device is already open with a bd_holder (in
      use by dm in my particular example) and two threads race to open a
      partition of that device for the first time, one opening with O_EXCL and
      one without.
      
      The thread that doesn't use O_EXCL gets through blkdev_get() to
      __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;
      
      Immediately thereafter the other thread, using FMODE_EXCL, calls
      bd_start_claiming() from blkdev_get().  This should fail because the
      whole device has a holder, but because bdev->bd_contains == bdev
      bd_may_claim() incorrectly reports success.
      This thread continues and blocks on bd_mutex.
      
      The first thread then sets bdev->bd_contains correctly and drops the mutex.
      The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
      again in:
      			BUG_ON(!bd_may_claim(bdev, whole, holder));
      The BUG_ON fires.
      
      Fix this by removing the dependency on ->bd_contains in
      bd_may_claim().  As bd_may_claim() has direct access to the whole
      device, it can simply test if the target bdev is the whole device.
      
      Fixes: 6b4517a7 ("block: implement bd_claiming and claiming block")
      Cc: stable@vger.kernel.org (v2.6.35+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bcc7f5b4
  10. 01 12月, 2016 1 次提交
    • R
      block: protect iterate_bdevs() against concurrent close · af309226
      Rabin Vincent 提交于
      If a block device is closed while iterate_bdevs() is handling it, the
      following NULL pointer dereference occurs because bdev->b_disk is NULL
      in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
      turn called by the mapping_cap_writeback_dirty() call in
      __filemap_fdatawrite_range()):
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
       IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       PGD 9e62067 PUD 9ee8067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
       task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
       RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
       RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
       RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
       R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
       FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
       Stack:
        ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
        0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
        ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
       Call Trace:
        [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
        [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
        [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
        [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
        [<ffffffff811b2763>] sys_sync+0x63/0x90
        [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
       Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
       RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
        RSP <ffff880009f5fe68>
       CR2: 0000000000000508
       ---[ end trace 2487336ceb3de62d ]---
      
      The crash is easily reproducible by running the following command, if an
      msleep(100) is inserted before the call to func() in iterate_devs():
      
       while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done
      
      Fix it by holding the bd_mutex across the func() call and only calling
      func() if the bdev is opened.
      
      Cc: stable@vger.kernel.org
      Fixes: 5c0d6b60 ("vfs: Create function for iterating over block devices")
      Reported-and-tested-by: NWei Fang <fangwei1@huawei.com>
      Signed-off-by: NRabin Vincent <rabinv@axis.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af309226
  11. 22 11月, 2016 3 次提交
  12. 18 11月, 2016 4 次提交
  13. 12 10月, 2016 1 次提交
  14. 06 10月, 2016 1 次提交
  15. 14 9月, 2016 1 次提交
  16. 25 8月, 2016 1 次提交
  17. 22 8月, 2016 1 次提交
    • V
      bdev: fix NULL pointer dereference · e9e5e3fa
      Vegard Nossum 提交于
      I got this:
      
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          Dumping ftrace buffer:
             (ftrace buffer empty)
          CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
          task: ffff880113415940 task.stack: ffff880118350000
          RIP: 0010:[<ffffffff8172cb32>]  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
          RSP: 0018:ffff880118357ca0  EFLAGS: 00010207
          RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
          RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
          RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
          R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
          FS:  00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
          DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
          Stack:
           ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
           ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
           0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
          Call Trace:
           [<ffffffff8167f207>] mount_fs+0x97/0x2e0
           [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760
           [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300
           [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50
           [<ffffffff816dfc87>] do_mount+0x3d7/0x2730
           [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0
           [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40
           [<ffffffff8161ea81>] ? memset+0x31/0x40
           [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320
           [<ffffffff816e2a02>] SyS_mount+0xb2/0x120
           [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970
           [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
           [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25
          Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
          RIP  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
           RSP <ffff880118357ca0>
          ---[ end trace 13690ad962168b98 ]---
      
      mount_pseudo() returns ERR_PTR(), not NULL, on error.
      
      Fixes: 3684aa70 ("block-dev: enable writeback cgroup support")
      Cc: Shaohua Li <shli@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9e5e3fa
  18. 08 8月, 2016 1 次提交
  19. 05 8月, 2016 1 次提交
  20. 04 8月, 2016 1 次提交
  21. 21 7月, 2016 1 次提交
  22. 20 7月, 2016 1 次提交
    • A
      bdev: get rid of ->bd_inodes · a4a4f943
      Al Viro 提交于
      Since 2006 we have ->i_bdev pinning bdev in question, so there's no
      way to get to bdev ->evict_inode() while there's an aliasing inode
      anywhere.  In other words, the only place walking the list of aliases
      is guaranteed to do it only when the list is empty...
      
      Remove the detritus; it should've been done in "[PATCH] Fix a race
      condition between ->i_mapping and iput()", but nobody had noticed it
      back then.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4a4f943
  23. 24 6月, 2016 1 次提交
    • E
      vfs: Generalize filesystem nodev handling. · a2982cc9
      Eric W. Biederman 提交于
      Introduce a function may_open_dev that tests MNT_NODEV and a new
      superblock flab SB_I_NODEV.  Use this new function in all of the
      places where MNT_NODEV was previously tested.
      
      Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
      filesystems should never support device nodes, and a simple superblock
      flags makes that very hard to get wrong.  With SB_I_NODEV set if any
      device nodes somehow manage to show up on on a filesystem those
      device nodes will be unopenable.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a2982cc9
  24. 21 5月, 2016 1 次提交
  25. 19 5月, 2016 1 次提交
  26. 17 5月, 2016 4 次提交
    • T
      block: Update blkdev_dax_capable() for consistency · a8078b1f
      Toshi Kani 提交于
      blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
      to remain as a separate interface for checking dax capability of
      a raw block device.
      
      Rename and relocate blkdev_dax_capable() to keep them maintained
      consistently, and call bdev_direct_access() for the dax capability
      check.
      
      There is no change in the behavior.
      
      Link: https://lkml.org/lkml/2016/5/9/950Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      a8078b1f
    • T
      block: Add bdev_dax_supported() for dax mount checks · 2d96afc8
      Toshi Kani 提交于
      DAX imposes additional requirements to a device.  Add
      bdev_dax_supported() which performs all the precondition checks
      necessary for filesystem to mount the device with dax option.
      
      Also add a new check to verify if a partition is aligned by 4KB.
      When a partition is unaligned, any dax read/write access fails,
      except for metadata update.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      2d96afc8
    • T
      block: Add vfs_msg() interface · 2af3a815
      Toshi Kani 提交于
      In preparation of moving DAX capability checks to the block layer
      from filesystem code, add a VFS message interface that aligns with
      filesystem's message format.
      
      For instance, a vfs_msg() message followed by XFS messages in case
      of a dax mount error may look like:
      
        VFS (pmem0p1): error: unaligned partition for dax
        XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
        XFS (pmem0p1): Mounting V5 Filesystem
         :
      
      vfs_msg() is largely based on ext4_msg().
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      2af3a815
    • J
      dax: Remove complete_unwritten argument · 02fbd139
      Jan Kara 提交于
      Fault handlers currently take complete_unwritten argument to convert
      unwritten extents after PTEs are updated. However no filesystem uses
      this anymore as the code is racy. Remove the unused argument.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      02fbd139
  27. 02 5月, 2016 3 次提交