1. 16 11月, 2018 1 次提交
  2. 28 9月, 2018 1 次提交
  3. 22 9月, 2018 1 次提交
    • O
      block: use nanosecond resolution for iostat · b57e99b4
      Omar Sandoval 提交于
      Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
      updating properly on 4.18. This is because we started using ktime to
      track elapsed time, and we convert nanoseconds to jiffies when we update
      the partition counter. However, this gets rounded down, so any I/Os that
      take less than a jiffy are not accounted for. Previously in this case,
      the value of jiffies would sometimes increment while we were doing I/O,
      so at least some I/Os were accounted for.
      
      Let's convert the stats to use nanoseconds internally. We still report
      milliseconds as before, now more accurately than ever. The value is
      still truncated to 32 bits for backwards compatibility.
      
      Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
      Cc: stable@vger.kernel.org
      Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b57e99b4
  4. 18 7月, 2018 2 次提交
  5. 25 5月, 2018 1 次提交
  6. 16 5月, 2018 1 次提交
  7. 26 4月, 2018 1 次提交
  8. 16 3月, 2018 1 次提交
  9. 27 2月, 2018 4 次提交
    • J
      genhd: Fix BUG in blkdev_open() · 56c0908c
      Jan Kara 提交于
      When two blkdev_open() calls for a partition race with device removal
      and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
      blkdev_open(). The race can happen as follows:
      
      CPU0				CPU1			CPU2
      							del_gendisk()
      							  bdev_unhash_inode(part1);
      
      blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
        bdev = bd_acquire()		  bdev = bd_acquire()
        blkdev_get(bdev)
          bd_start_claiming(bdev)
            - finds old inode 'whole'
            bd_prepare_to_claim() -> 0
      							  bdev_unhash_inode(whole);
      							<device removed>
      							<new device under same
      							 number created>
      				  blkdev_get(bdev);
      				    bd_start_claiming(bdev)
      				      - finds new inode 'whole'
      				      bd_prepare_to_claim()
      					- this also succeeds as we have
      					  different 'whole' here...
      					- bad things happen now as we
      					  have two exclusive openers of
      					  the same bdev
      
      The problem here is that block device opens can see various intermediate
      states while gendisk is shutting down and then being recreated.
      
      We fix the problem by introducing new lookup_sem in gendisk that
      synchronizes gendisk deletion with get_gendisk() and furthermore by
      making sure that get_gendisk() does not return gendisk that is being (or
      has been) deleted. This makes sure that once we ever manage to look up
      newly created bdev inode, we are also guaranteed that following
      get_gendisk() will either return failure (and we fail open) or it
      returns gendisk for the new device and following bdget_disk() will
      return new bdev inode (i.e., blkdev_open() follows the path as if it is
      completely run after new device is created).
      Reported-and-analyzed-by: NHou Tao <houtao1@huawei.com>
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56c0908c
    • J
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara 提交于
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9df6c299
    • J
      genhd: Rename get_disk() to get_disk_and_module() · 3079c22e
      Jan Kara 提交于
      Rename get_disk() to get_disk_and_module() to make sure what the
      function does. It's not a great name but at least it is now clear that
      put_disk() is not it's counterpart.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3079c22e
    • J
      genhd: Fix leaked module reference for NVME devices · d52987b5
      Jan Kara 提交于
      Commit 8ddcd653 "block: introduce GENHD_FL_HIDDEN" added handling of
      hidden devices to get_gendisk() but forgot to drop module reference
      which is also acquired by get_disk(). Drop the reference as necessary.
      
      Arguably the function naming here is misleading as put_disk() is *not*
      the counterpart of get_disk() but let's fix that in the follow up
      commit since that will be more intrusive.
      
      Fixes: 8ddcd653
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d52987b5
  10. 15 1月, 2018 2 次提交
    • M
      block: allow gendisk's request_queue registration to be deferred · fa70d2e2
      Mike Snitzer 提交于
      Since I can remember DM has forced the block layer to allow the
      allocation and initialization of the request_queue to be distinct
      operations.  Reason for this is block/genhd.c:add_disk() has requires
      that the request_queue (and associated bdi) be tied to the gendisk
      before add_disk() is called -- because add_disk() also deals with
      exposing the request_queue via blk_register_queue().
      
      DM's dynamic creation of arbitrary device types (and associated
      request_queue types) requires the DM device's gendisk be available so
      that DM table loads can establish a master/slave relationship with
      subordinate devices that are referenced by loaded DM tables -- using
      bd_link_disk_holder().  But until these DM tables, and their associated
      subordinate devices, are known DM cannot know what type of request_queue
      it needs -- nor what its queue_limits should be.
      
      This chicken and egg scenario has created all manner of problems for DM
      and, at times, the block layer.
      
      Summary of changes:
      
      - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
        that drivers may use to add a disk without also calling
        blk_register_queue().  Driver must call blk_register_queue() once its
        request_queue is fully initialized.
      
      - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
        is not set.  It won't be set if driver used add_disk_no_queue_reg()
        but driver encounters an error and must del_gendisk() before calling
        blk_register_queue().
      
      - Export blk_register_queue().
      
      These changes allow DM to use add_disk_no_queue_reg() to anchor its
      gendisk as the "master" for master/slave relationships DM must establish
      with subordinate devices referenced in DM tables that get loaded.  Once
      all "slave" devices for a DM device are known its request_queue can be
      properly initialized and then advertised via sysfs -- important
      improvement being that no request_queue resource initialization
      performed by blk_register_queue() is missed for DM devices anymore.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fa70d2e2
    • M
      block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN · bc8d062c
      Mike Snitzer 提交于
      device_add_disk() will only call bdi_register_owner() if
      !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
      bdi_unregister() if !GENHD_FL_HIDDEN.
      
      Found with code inspection.  bdi_unregister() won't do any harm if
      bdi_register_owner() wasn't used but best to avoid the unnecessary
      call to bdi_unregister().
      
      Fixes: 8ddcd653 ("block: introduce GENHD_FL_HIDDEN")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc8d062c
  11. 20 11月, 2017 2 次提交
  12. 11 11月, 2017 2 次提交
  13. 04 11月, 2017 2 次提交
  14. 26 10月, 2017 1 次提交
    • B
      block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() · e319e1fb
      Byungchul Park 提交于
      Darrick posted the following warning and Dave Chinner analyzed it:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.14.0-rc1-fixes #1 Tainted: G        W
      > ------------------------------------------------------
      > loop0/31693 is trying to acquire lock:
      >  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
      >
      > but now in release context of a crosslock acquired at the following:
      >  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
      >
      > which lock already depends on the new lock.
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #2 ((complete)&ret.event){+.+.}:
      >        lock_acquire+0xab/0x200
      >        wait_for_completion_io+0x4e/0x1a0
      >        submit_bio_wait+0x7f/0xb0
      >        blkdev_issue_zeroout+0x71/0xa0
      >        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
      >        xfs_bmapi_write+0x374/0x11f0 [xfs]
      >        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
      >        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
      >        iomap_apply+0x43/0xe0
      >        dax_iomap_rw+0x89/0xf0
      >        xfs_file_dax_write+0xcc/0x220 [xfs]
      >        xfs_file_write_iter+0xf0/0x130 [xfs]
      >        __vfs_write+0xd9/0x150
      >        vfs_write+0xc8/0x1c0
      >        SyS_write+0x45/0xa0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #1 (&xfs_nondir_ilock_class){++++}:
      >        lock_acquire+0xab/0x200
      >        down_write_nested+0x4a/0xb0
      >        xfs_ilock+0x263/0x330 [xfs]
      >        xfs_setattr_size+0x152/0x370 [xfs]
      >        xfs_vn_setattr+0x6b/0x90 [xfs]
      >        notify_change+0x27d/0x3f0
      >        do_truncate+0x5b/0x90
      >        path_openat+0x237/0xa90
      >        do_filp_open+0x8a/0xf0
      >        do_sys_open+0x11c/0x1f0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
      >        up_write+0x1c/0x40
      >        xfs_iunlock+0x1d0/0x310 [xfs]
      >        xfs_file_fallocate+0x8a/0x310 [xfs]
      >        loop_queue_work+0xb7/0x8d0
      >        kthread_worker_fn+0xb9/0x1f0
      >
      > Chain exists of:
      >   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
      >
      >  Possible unsafe locking scenario by crosslock:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&xfs_nondir_ilock_class);
      >   lock((complete)&ret.event);
      >                                lock(&(&ip->i_mmaplock)->mr_lock);
      >                                unlock((complete)&ret.event);
      >
      >                *** DEADLOCK ***
      
      The warning is a false positive, caused by the fact that all
      wait_for_completion()s in submit_bio_wait() are waiting with the same
      lock class.
      
      However, some bios have nothing to do with others, for example in the case
      of loop devices, there's no direct connection between the bios of an upper
      device and the bios of a lower device(=loop device).
      
      The safest way to assign different lock classes to different devices is
      to do it for each gendisk. In other words, this patch assigns a
      lockdep_map per gendisk and uses it when initializing completion in
      submit_bio_wait().
      Analyzed-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e319e1fb
  15. 24 8月, 2017 2 次提交
  16. 18 8月, 2017 1 次提交
  17. 10 8月, 2017 3 次提交
  18. 17 7月, 2017 1 次提交
    • L
      block: order /proc/devices by major number · 133d55cd
      Logan Gunthorpe 提交于
      Presently, the order of the block devices listed in /proc/devices is not
      entirely sequential. If a block device has a major number greater than
      BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
      module 255. For example, 511 appears after 1.
      
      This patch cleans that up and prints each major number in the correct
      order, regardless of where they are stored in the hash table.
      
      In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
      limit (chosen to be 512). It will then print all devices in major
      order number from 0 to the maximum.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      133d55cd
  19. 21 6月, 2017 1 次提交
  20. 28 4月, 2017 1 次提交
  21. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  22. 23 3月, 2017 1 次提交
    • J
      block: Fix oops scsi_disk_get() · d01b2dcb
      Jan Kara 提交于
      When device open races with device shutdown, we can get the following
      oops in scsi_disk_get():
      
      [11863.044351] general protection fault: 0000 [#1] SMP
      [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
      [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
      [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
      [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
      [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
      [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
      [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
      [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
      [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
      [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
      [11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
      [11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
      [11863.059217] Call Trace:
      [11863.059217]  ? disk_get_part+0x22/0x1f0
      [11863.059217]  sd_open+0x39/0x130
      [11863.059217]  __blkdev_get+0x69/0x430
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? bd_acquire+0x96/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_get+0x126/0x350
      [11863.059217]  ? _raw_spin_unlock+0x2b/0x40
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_open+0x65/0x80
      ...
      
      As you can see RAX value is already poisoned showing that gendisk we got
      is already freed. The problem is that get_gendisk() looks up device
      number in ext_devt_idr and then does get_disk() which does kobject_get()
      on the disks kobject. However the disk gets removed from ext_devt_idr
      only in disk_release() (through blk_free_devt()) at which moment it has
      already 0 refcount and is already on its way to be freed. Indeed we've
      got a warning from kobject_get() about 0 refcount shortly before the
      oops.
      
      We fix the problem by using kobject_get_unless_zero() in get_disk() so
      that get_disk() cannot get reference on a disk that is already being
      freed.
      Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d01b2dcb
  23. 09 3月, 2017 2 次提交
  24. 03 3月, 2017 1 次提交
  25. 22 2月, 2017 2 次提交
  26. 02 2月, 2017 2 次提交