1. 23 3月, 2017 13 次提交
    • C
      blk-mq: improve blk_mq_try_issue_directly · 5eb6126e
      Christoph Hellwig 提交于
      Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
      new wrapper that takes care of RCU / SRCU locking to avoid having
      boileplate code in the caller which would get duplicated with new callers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5eb6126e
    • C
      blk-mq: merge mq and sq make_request instances · 254d259d
      Christoph Hellwig 提交于
      They are mostly the same code anyway - this just one small conditional
      for the plug case that is different for both variants.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      254d259d
    • C
      blk-mq: remove BLK_MQ_F_DEFER_ISSUE · 7642747d
      Christoph Hellwig 提交于
      This flag was never used since it was introduced.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7642747d
    • J
      block: Fix oops scsi_disk_get() · d01b2dcb
      Jan Kara 提交于
      When device open races with device shutdown, we can get the following
      oops in scsi_disk_get():
      
      [11863.044351] general protection fault: 0000 [#1] SMP
      [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
      [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
      [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
      [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
      [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
      [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
      [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
      [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
      [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
      [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
      [11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
      [11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
      [11863.059217] Call Trace:
      [11863.059217]  ? disk_get_part+0x22/0x1f0
      [11863.059217]  sd_open+0x39/0x130
      [11863.059217]  __blkdev_get+0x69/0x430
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? bd_acquire+0x96/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_get+0x126/0x350
      [11863.059217]  ? _raw_spin_unlock+0x2b/0x40
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_open+0x65/0x80
      ...
      
      As you can see RAX value is already poisoned showing that gendisk we got
      is already freed. The problem is that get_gendisk() looks up device
      number in ext_devt_idr and then does get_disk() which does kobject_get()
      on the disks kobject. However the disk gets removed from ext_devt_idr
      only in disk_release() (through blk_free_devt()) at which moment it has
      already 0 refcount and is already on its way to be freed. Indeed we've
      got a warning from kobject_get() about 0 refcount shortly before the
      oops.
      
      We fix the problem by using kobject_get_unless_zero() in get_disk() so
      that get_disk() cannot get reference on a disk that is already being
      freed.
      Tested-by: NLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d01b2dcb
    • J
      kobject: Export kobject_get_unless_zero() · c70c176f
      Jan Kara 提交于
      Make the function available for outside use and fortify it against NULL
      kobject.
      
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c70c176f
    • J
      block: Fix oops in locked_inode_to_wb_and_lock_list() · f759741d
      Jan Kara 提交于
      When block device is closed, we call inode_detach_wb() in __blkdev_put()
      which sets inode->i_wb to NULL. That is contrary to expectations that
      inode->i_wb stays valid once set during the whole inode's lifetime and
      leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
      inode_to_wb() returned NULL.
      
      The reason why we called inode_detach_wb() is not valid anymore though.
      BDI is guaranteed to stay along until we call bdi_put() from
      bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
      moment.
      
      Also add a warning to catch if someone uses inode_detach_wb() in a
      dangerous way.
      Reported-by: NThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f759741d
    • J
      bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc
      Jan Kara 提交于
      Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
      from bdi_unregister() which is not necessarily called from bdi_destroy()
      and thus the name is somewhat misleading.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1c51afc
    • J
      bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e
      Jan Kara 提交于
      Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
      (called from bdi_unregister()). That is however unnecessary now when
      cgwb->bdi is a proper refcounted reference (thus bdi cannot get
      released before all cgwbs are released) and when cgwb_bdi_destroy()
      shuts down writeback directly.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4514451e
    • J
      bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d
      Jan Kara 提交于
      Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
      which also means that writeback has been shutdown on them. Since this
      wait is going away, directly shutdown writeback on cgwbs from
      cgwb_bdi_destroy() to avoid live writeback structures after
      bdi_unregister() has finished. To make that safe with concurrent
      shutdown from cgwb_release_workfn(), we also have to make sure
      wb_shutdown() returns only after the bdi_writeback structure is really
      shutdown.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5318ce7d
    • J
      bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3
      Jan Kara 提交于
      Currently root wb_writeback structure is added to bdi->wb_list in
      bdi_init() and never removed. That is different from all other
      wb_writeback structures which get added to the list when created and
      removed from it before wb_shutdown().
      
      So move list addition of root bdi_writeback to bdi_register() and list
      removal of all wb_writeback structures to wb_shutdown(). That way a
      wb_writeback structure is on bdi->wb_list if and only if it can handle
      writeback and it will make it easier for us to handle shutdown of all
      wb_writeback structures in bdi_unregister().
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8cb72b3
    • J
      bdi: Make wb->bdi a proper reference · 810df54a
      Jan Kara 提交于
      Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
      structures except for the one embedded inside struct backing_dev_info.
      That will allow us to simplify bdi unregistration.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      810df54a
    • J
      bdi: Mark congested->bdi as internal · b7d680d7
      Jan Kara 提交于
      congested->bdi pointer is used only to be able to remove congested
      structure from bdi->cgwb_congested_tree on structure release. Moreover
      the pointer can become NULL when we unregister the bdi. Rename the field
      to __bdi and add a comment to make it more explicit this is internal
      stuff of memcg writeback code and people should not use the field as
      such use will be likely race prone.
      
      We do not bother with converting congested->bdi to a proper refcounted
      reference. It will be slightly ugly to special-case bdi->wb.congested to
      avoid effectively a cyclic reference of bdi to itself and the reference
      gets cleared from bdi_unregister() making it impossible to reference
      a freed bdi.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7d680d7
    • J
      block: Fix bdi assignment to bdev inode when racing with disk delete · 03e26279
      Jan Kara 提交于
      When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
      restart the process of opening the block device. However we forget to
      switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
      inode will be pointing to a stale bdi. Fix the problem by setting
      bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      03e26279
  2. 22 3月, 2017 6 次提交
    • J
      block: fix stacked driver stats init and free · a83b576c
      Jens Axboe 提交于
      If a driver allocates a queue for stacked usage, then it does
      not currently get stats allocated. This causes the later init
      of, eg, writeback throttling to blow up. Move the init to the
      queue allocation instead.
      
      Additionally, allow a NULL callback unregistration. This avoids
      having the caller check for that, fixing another oops on
      removal of a block device that doesn't have poll stats allocated.
      
      Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a83b576c
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c · 4875253f
      Omar Sandoval 提交于
      This is an implementation detail that no-one outside of blk-stat.c uses.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4875253f
    • O
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval 提交于
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa2e39cb
    • O
      block: remove extra calls to wbt_exit() · 0315b159
      Omar Sandoval 提交于
      We always call wbt_exit() from blk_release_queue(), so these are
      unnecessary.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0315b159
    • O
      blk-stat: fix blk_stat_sum() if all samples are batched · 7d8d0014
      Omar Sandoval 提交于
      We need to flush the batch _before_ we check the number of samples,
      otherwise we'll miss all of the batched samples.
      
      Fixes: cf43e6be ("block: add scalable completion tracking of requests")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7d8d0014
  3. 20 3月, 2017 6 次提交
    • L
      Linux 4.11-rc3 · 97da3854
      Linus Torvalds 提交于
      97da3854
    • L
      mm/swap: don't BUG_ON() due to uninitialized swap slot cache · 452b94b8
      Linus Torvalds 提交于
      This BUG_ON() triggered for me once at shutdown, and I don't see a
      reason for the check.  The code correctly checks whether the swap slot
      cache is usable or not, so an uninitialized swap slot cache is not
      actually problematic afaik.
      
      I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
      I'm not sure why that seemingly pointless check was there.  I suspect
      the real fix is to just remove it entirely, but for now we'll warn about
      it but not bring the machine down.
      
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452b94b8
    • L
      Merge tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a07a6e41
      Linus Torvalds 提交于
      Pull more powerpc fixes from Michael Ellerman:
       "A couple of minor powerpc fixes for 4.11:
      
         - wire up statx() syscall
      
         - don't print a warning on memory hotplug when HPT resizing isn't
           available
      
        Thanks to: David Gibson, Chandan Rajendra"
      
      * tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Don't give a warning when HPT resizing isn't available
        powerpc: Wire up statx() syscall
      a07a6e41
    • L
      Merge branch 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 4571bc5a
      Linus Torvalds 提交于
      Pull parisc fixes from Helge Deller:
      
       - Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
         modules with CONFIG_MODVERSIONS.
      
       - Dave Anglin optimized the cache flushing for vmap ranges.
      
       - Arvind Yadav provided a fix for a potential NULL pointer dereference
         in the parisc perf code (and some code cleanups).
      
       - I wired up the new statx system call, fixed some compiler warnings
         with the access_ok() macro and fixed shutdown code to really halt a
         system at shutdown instead of crashing & rebooting.
      
      * 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Fix system shutdown halt
        parisc: perf: Fix potential NULL pointer dereference
        parisc: Avoid compiler warnings with access_ok()
        parisc: Wire up statx system call
        parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
        parisc: support R_PARISC_SECREL32 relocation in modules
      4571bc5a
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 8aa34172
      Linus Torvalds 提交于
      Pull SCSI target fixes from Nicholas Bellinger:
       "The bulk of the changes are in qla2xxx target driver code to address
        various issues found during Cavium/QLogic's internal testing (stable
        CC's included), along with a few other stability and smaller
        miscellaneous improvements.
      
        There are also a couple of different patch sets from Mike Christie,
        which have been a result of his work to use target-core ALUA logic
        together with tcm-user backend driver.
      
        Finally, a patch to address some long standing issues with
        pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
        which will make folks using physical (or virtual) magnetic tape happy"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
        qla2xxx: Update driver version to 9.00.00.00-k
        qla2xxx: Fix delayed response to command for loop mode/direct connect.
        qla2xxx: Change scsi host lookup method.
        qla2xxx: Add DebugFS node to display Port Database
        qla2xxx: Use IOCB interface to submit non-critical MBX.
        qla2xxx: Add async new target notification
        qla2xxx: Export DIF stats via debugfs
        qla2xxx: Improve T10-DIF/PI handling in driver.
        qla2xxx: Allow relogin to proceed if remote login did not finish
        qla2xxx: Fix sess_lock & hardware_lock lock order problem.
        qla2xxx: Fix inadequate lock protection for ABTS.
        qla2xxx: Fix request queue corruption.
        qla2xxx: Fix memory leak for abts processing
        qla2xxx: Allow vref count to timeout on vport delete.
        tcmu: Convert cmd_time_out into backend device attribute
        tcmu: make cmd timeout configurable
        tcmu: add helper to check if dev was configured
        target: fix race during implicit transition work flushes
        target: allow userspace to set state to transitioning
        target: fix ALUA transition timeout handling
        ...
      8aa34172
    • L
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 1b8df619
      Linus Torvalds 提交于
      Pull device-dax fixes from Dan Williams:
       "The device-dax driver was not being careful to handle falling back to
        smaller fault-granularity sizes.
      
        The driver already fails fault attempts that are smaller than the
        device's alignment, but it also needs to handle the cases where a
        larger page mapping could be established. For simplicity of the
        immediate fix the implementation just signals VM_FAULT_FALLBACK until
        fault-size == device-alignment.
      
        One fix is for -stable to address pmd-to-pte fallback from the
        original implementation, another fix is for the new (introduced in
        4.11-rc1) pud-to-pmd regression, and a typo fix comes along for the
        ride.
      
        These have received a build success notification from the kbuild
        robot"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        device-dax: fix debug output typo
        device-dax: fix pud fault fallback handling
        device-dax: fix pmd/pte fault fallback handling
      1b8df619
  4. 19 3月, 2017 15 次提交