1. 26 10月, 2017 1 次提交
    • B
      block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() · e319e1fb
      Byungchul Park 提交于
      Darrick posted the following warning and Dave Chinner analyzed it:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.14.0-rc1-fixes #1 Tainted: G        W
      > ------------------------------------------------------
      > loop0/31693 is trying to acquire lock:
      >  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
      >
      > but now in release context of a crosslock acquired at the following:
      >  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
      >
      > which lock already depends on the new lock.
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #2 ((complete)&ret.event){+.+.}:
      >        lock_acquire+0xab/0x200
      >        wait_for_completion_io+0x4e/0x1a0
      >        submit_bio_wait+0x7f/0xb0
      >        blkdev_issue_zeroout+0x71/0xa0
      >        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
      >        xfs_bmapi_write+0x374/0x11f0 [xfs]
      >        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
      >        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
      >        iomap_apply+0x43/0xe0
      >        dax_iomap_rw+0x89/0xf0
      >        xfs_file_dax_write+0xcc/0x220 [xfs]
      >        xfs_file_write_iter+0xf0/0x130 [xfs]
      >        __vfs_write+0xd9/0x150
      >        vfs_write+0xc8/0x1c0
      >        SyS_write+0x45/0xa0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #1 (&xfs_nondir_ilock_class){++++}:
      >        lock_acquire+0xab/0x200
      >        down_write_nested+0x4a/0xb0
      >        xfs_ilock+0x263/0x330 [xfs]
      >        xfs_setattr_size+0x152/0x370 [xfs]
      >        xfs_vn_setattr+0x6b/0x90 [xfs]
      >        notify_change+0x27d/0x3f0
      >        do_truncate+0x5b/0x90
      >        path_openat+0x237/0xa90
      >        do_filp_open+0x8a/0xf0
      >        do_sys_open+0x11c/0x1f0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
      >        up_write+0x1c/0x40
      >        xfs_iunlock+0x1d0/0x310 [xfs]
      >        xfs_file_fallocate+0x8a/0x310 [xfs]
      >        loop_queue_work+0xb7/0x8d0
      >        kthread_worker_fn+0xb9/0x1f0
      >
      > Chain exists of:
      >   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
      >
      >  Possible unsafe locking scenario by crosslock:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&xfs_nondir_ilock_class);
      >   lock((complete)&ret.event);
      >                                lock(&(&ip->i_mmaplock)->mr_lock);
      >                                unlock((complete)&ret.event);
      >
      >                *** DEADLOCK ***
      
      The warning is a false positive, caused by the fact that all
      wait_for_completion()s in submit_bio_wait() are waiting with the same
      lock class.
      
      However, some bios have nothing to do with others, for example in the case
      of loop devices, there's no direct connection between the bios of an upper
      device and the bios of a lower device(=loop device).
      
      The safest way to assign different lock classes to different devices is
      to do it for each gendisk. In other words, this patch assigns a
      lockdep_map per gendisk and uses it when initializing completion in
      submit_bio_wait().
      Analyzed-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e319e1fb
  2. 25 10月, 2017 2 次提交
    • C
      block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait() · 65e53aab
      Christoph Hellwig 提交于
      Simplify the code by getting rid of the submit_bio_ret structure.
      
      (This also helps address a lockdep false positive.)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: darrick.wong@oracle.com
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      65e53aab
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  3. 11 10月, 2017 3 次提交
  4. 04 10月, 2017 3 次提交
    • B
      bsg-lib: fix use-after-free under memory-pressure · eab40cf3
      Benjamin Block 提交于
      When under memory-pressure it is possible that the mempool which backs
      the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
      emergency buffers - in case it can't get a regular allocation. These
      buffers are preallocated and once they are also used, they are
      re-supplied with old finished requests from the same request_queue (see
      mempool_free()).
      
      The bug is, when re-supplying the emergency pool, the old requests are
      not again ran through the callback mempool_t->alloc(), and thus also not
      through the callback bsg_init_rq(). Thus we skip initialization, and
      while the sense-buffer still should be good, scsi_request->cmd might
      have become to be an invalid pointer in the meantime. When the request
      is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
      bsg will replace it with a custom allocated buffer, which is freed when
      the user's command is finished, thus it dangles afterwards. When next a
      command is sent by the user that has a smaller/similar CDB as
      BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
      scsi_request->__cmd, will not make a custom allocation, and write into
      undefined memory.
      
      Fix this by splitting bsg_init_rq() into two functions:
       - bsg_init_rq() is changed to only do the allocation of the
         sense-buffer, which is used to back the bsg job's reply buffer. This
         pointer should never change during the lifetime of a scsi_request, so
         it doesn't need re-initialization.
       - bsg_initialize_rq() is a new function that makes use of
         'struct request_queue's initialize_rq_fn callback (which was
         introduced in v4.12). This is always called before the request is
         given out via blk_get_request(). This function does the remaining
         initialization that was previously done in bsg_init_rq(), and will
         also do it when the request is taken from the emergency-pool of the
         backing mempool.
      
      Fixes: 50b4d485 ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
      Cc: <stable@vger.kernel.org> # 4.11+
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBenjamin Block <bblock@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eab40cf3
    • O
      blk-mq-debugfs: fix device sched directory for default scheduler · 70e62f4b
      Omar Sandoval 提交于
      In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
      directories if a default scheduler was already configured by
      blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
      the same for the device-wide sched directory. Fix it.
      
      Fixes: d332ce09 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70e62f4b
    • J
      blk-throttle: fix possible io stall when upgrade to max · 4f02fb76
      Joseph Qi 提交于
      There is a case which will lead to io stall. The case is described as
      follows.
      /test1
        |-subtest1
      /test2
        |-subtest2
      And subtest1 and subtest2 each has 32 queued bios already.
      
      Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
      bios as follows:
      1) tg=subtest1, do nothing;
      2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
      left, no need to schedule next dispatch;
      3) tg=subtest2, do nothing;
      4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
      left, no need to schedule next dispatch;
      5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
      test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
      to /; note that test1 and test2 each still has 16 queued bios left;
      6) tg=/, try to schedule next dispatch, but since disptime is now
      (update in tg_update_disptime, wait=0), pending timer is not scheduled
      in fact;
      7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
      32 left. test1 and test2 each has 16 queued bios;
      8) throtl_pending_timer_fn sees the left over bios, but could do
      nothing, because throtl_select_dispatch returns 0, and test1/test2 has
      no pending tg.
      
      The blktrace shows the following:
      8,32   0        0     2.539007641     0  m   N throtl upgrade to max
      8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
      8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16
      
      So force schedule dispatch if there are pending children.
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f02fb76
  5. 25 9月, 2017 3 次提交
    • S
      block: fix a crash caused by wrong API · f5c156c4
      Shaohua Li 提交于
      part_stat_show takes a part device not a disk, so we should use
      part_to_disk.
      
      Fixes: d62e26b3("block: pass in queue to inflight accounting")
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5c156c4
    • W
      blktrace: Fix potential deadlock between delete & sysfs ops · 5acb3cc2
      Waiman Long 提交于
      The lockdep code had reported the following unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(s_active#228);
                                     lock(&bdev->bd_mutex/1);
                                     lock(s_active#228);
        lock(&bdev->bd_mutex);
      
       *** DEADLOCK ***
      
      The deadlock may happen when one task (CPU1) is trying to delete a
      partition in a block device and another task (CPU0) is accessing
      tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
      partition.
      
      The s_active isn't an actual lock. It is a reference count (kn->count)
      on the sysfs (kernfs) file. Removal of a sysfs file, however, require
      a wait until all the references are gone. The reference count is
      treated like a rwsem using lockdep instrumentation code.
      
      The fact that a thread is in the sysfs callback method or in the
      ioctl call means there is a reference to the opended sysfs or device
      file. That should prevent the underlying block structure from being
      removed.
      
      Instead of using bd_mutex in the block_device structure, a new
      blk_trace_mutex is now added to the request_queue structure to protect
      access to the blk_trace structure.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      
      Fix typo in patch subject line, and prune a comment detailing how
      the code used to work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5acb3cc2
    • C
      bsg-lib: don't free job in bsg_prepare_job · f507b54d
      Christoph Hellwig 提交于
      The job structure is allocated as part of the request, so we should not
      free it in the error path of bsg_prepare_job.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f507b54d
  6. 12 9月, 2017 1 次提交
    • J
      block: directly insert blk-mq request from blk_insert_cloned_request() · 157f377b
      Jens Axboe 提交于
      A NULL pointer crash was reported for the case of having the BFQ IO
      scheduler attached to the underlying blk-mq paths of a DM multipath
      device.  The crash occured in blk_mq_sched_insert_request()'s call to
      e->type->ops.mq.insert_requests().
      
      Paolo Valente correctly summarized why the crash occured with:
      "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
      blk_rq_prep_clone) creates a cloned request without invoking
      e->type->ops.mq.prepare_request for the target elevator e.  The cloned
      request is therefore not initialized for the scheduler, but it is
      however inserted into the scheduler by blk_mq_sched_insert_request."
      
      All said, a request-based DM multipath device's IO scheduler should be
      the only one used -- when the original requests are issued to the
      underlying paths as cloned requests they are inserted directly in the
      underlying dispatch queue(s) rather than through an additional elevator.
      
      But commit bd166ef1 ("blk-mq-sched: add framework for MQ capable IO
      schedulers") switched blk_insert_cloned_request() from using
      blk_mq_insert_request() to blk_mq_sched_insert_request().  Which
      incorrectly added elevator machinery into a call chain that isn't
      supposed to have any.
      
      To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
      that blk_insert_cloned_request() calls to insert the request without
      involving any elevator that may be attached to the cloned request's
      request_queue.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Cc: stable@vger.kernel.org
      Reported-by: NBart Van Assche <Bart.VanAssche@wdc.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      157f377b
  7. 11 9月, 2017 2 次提交
  8. 09 9月, 2017 2 次提交
  9. 02 9月, 2017 5 次提交
  10. 01 9月, 2017 1 次提交
    • B
      compat_hdio_ioctl: Fix a declaration · 8363dae2
      Bart Van Assche 提交于
      This patch avoids that sparse reports the following warning messages:
      
      block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different address spaces)
      block/compat_ioctl.c:85:11:    expected unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:85:11:    got void [noderef] <asn:1>*
      block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different address spaces)
      block/compat_ioctl.c:91:21:    expected void const volatile [noderef] <asn:1>*<noident>
      block/compat_ioctl.c:91:21:    got unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:87:53: warning: dereference of noderef expression
      block/compat_ioctl.c:91:21: warning: dereference of noderef expression
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8363dae2
  11. 31 8月, 2017 3 次提交
    • P
      block, bfq: guarantee update_next_in_service always returns an eligible entity · 24d90bb2
      Paolo Valente 提交于
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, then it doesn't
      invoke bfq_lookup_next_entity to get the next-in-service entity. In
      contrast, it follows a shorter path: if E happens to be eligible (see
      commit "bfq-sq-mq: make lookup_next_entity push up vtime on
      expirations" for details on eligibility) and to have a lower virtual
      finish time than the current candidate as next-in-service entity, then
      E directly becomes the next-in-service entity. Unfortunately, there is
      a corner case for which this shorter path makes
      bfq_update_next_in_service choose a non eligible entity: it occurs if
      both E and the current next-in-service entity happen to be non
      eligible when bfq_update_next_in_service is invoked. In this case, E
      is not set as next-in-service, and, since bfq_lookup_next_entity is
      not invoked, the state of the parent entity is not updated so as to
      end up with an eligible entity as the proper next-in-service entity.
      
      In this respect, next-in-service is actually allowed to be non
      eligible while some queue is in service: since no system-virtual-time
      push-up can be performed in that case (see again commit "bfq-sq-mq:
      make lookup_next_entity push up vtime on expirations" for details),
      next-in-service is chosen, speculatively, as a function of the
      possible value that the system virtual time may get after a push
      up. But the correctness of the schedule breaks if next-in-service is
      still a non eligible entity when it is time to set in service the next
      entity. Unfortunately, this may happen in the above corner case.
      
      This commit fixes this problem by making bfq_update_next_in_service
      invoke bfq_lookup_next_entity not only if the above shorter path
      cannot be taken, but also if the shorter path is taken but fails to
      yield an eligible next-in-service entity.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24d90bb2
    • P
      block, bfq: remove direct switch to an entity in higher class · a02195ce
      Paolo Valente 提交于
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, and finds out
      that E belongs to a higher-priority class than that of the current
      next-in-service entity, then it sets next_in_service directly to
      E. But this may lead to anomalous schedules, because E may happen not
      be eligible for service, because its virtual start time is higher than
      the system virtual time for its service tree.
      
      This commit addresses this issue by simply removing this direct
      switch.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a02195ce
    • P
      block, bfq: make lookup_next_entity push up vtime on expirations · 80294c3b
      Paolo Valente 提交于
      To provide a very smooth service, bfq starts to serve a bfq_queue
      only if the queue is 'eligible', i.e., if the same queue would
      have started to be served in the ideal, perfectly fair system that
      bfq simulates internally. This is obtained by associating each
      queue with a virtual start time, and by computing a special system
      virtual time quantity: a queue is eligible only if the system
      virtual time has reached the virtual start time of the
      queue. Finally, bfq guarantees that, when a new queue must be set
      in service, there is always at least one eligible entity for each
      active parent entity in the scheduler. To provide this guarantee,
      the function __bfq_lookup_next_entity pushes up, for each parent
      entity on which it is invoked, the system virtual time to the
      minimum among the virtual start times of the entities in the
      active tree for the parent entity (more precisely, the push up
      occurs if the system virtual time happens to be lower than all
      such virtual start times).
      
      There is however a circumstance in which __bfq_lookup_next_entity
      cannot push up the system virtual time for a parent entity, even
      if the system virtual time is lower than the virtual start times
      of all the child entities in the active tree. It happens if one of
      the child entities is in service. In fact, in such a case, there
      is already an eligible entity, the in-service one, even if it may
      not be not present in the active tree (because in-service entities
      may be removed from the active tree).
      
      Unfortunately, in the last re-design of the
      hierarchical-scheduling engine, the reset of the pointer to the
      in-service entity for a given parent entity--reset to be done as a
      consequence of the expiration of the in-service entity--always
      happens after the function __bfq_lookup_next_entity has been
      invoked. This causes the function to think that there is still an
      entity in service for the parent entity, and then that the system
      virtual time cannot be pushed up, even if actually such a
      no-more-in-service entity has already been properly reinserted
      into the active tree (or in some other tree if no more
      active). Yet, the system virtual time *had* to be pushed up, to be
      ready to correctly choose the next queue to serve. Because of the
      lack of this push up, bfq may wrongly set in service a queue that
      had been speculatively pre-computed as the possible
      next-in-service queue, but that would no more be the one to serve
      after the expiration and the reinsertion into the active trees of
      the previously in-service entities.
      
      This commit addresses this issue by making
      __bfq_lookup_next_entity properly push up the system virtual time
      if an expiration is occurring.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80294c3b
  12. 30 8月, 2017 4 次提交
  13. 29 8月, 2017 4 次提交
    • D
      block: Make blk_dequeue_request() static · 5034435c
      Damien Le Moal 提交于
      The only caller of this function is blk_start_request() in the same
      file. Fix blk_start_request() description accordingly.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5034435c
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
    • D
      block: fix warning when I/O elevator is changed as request_queue is being removed · e9a823fb
      David Jeffery 提交于
      There is a race between changing I/O elevator and request_queue removal
      which can trigger the warning in kobject_add_internal.  A program can
      use sysfs to request a change of elevator at the same time another task
      is unregistering the request_queue the elevator would be attached to.
      The elevator's kobject will then attempt to be connected to the
      request_queue in the object tree when the request_queue has just been
      removed from sysfs.  This triggers the warning in kobject_add_internal
      as the request_queue no longer has a sysfs directory:
      
      kobject_add_internal failed for iosched (error: -2 parent: queue)
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0
      
      To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
      changing the elevator and use the request_queue's sysfs_lock to
      serialize between clearing the flag and the elevator testing the flag.
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e9a823fb
    • W
      block, scheduler: convert xxx_var_store to void · 235f8da1
      weiping zhang 提交于
      The last parameter "count" never be used in xxx_var_store,
      convert these functions to void.
      Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      235f8da1
  14. 26 8月, 2017 3 次提交
  15. 25 8月, 2017 1 次提交
  16. 24 8月, 2017 2 次提交
    • B
      compat_hdio_ioctl: Fix a declaration · 6a934bb8
      Bart Van Assche 提交于
      This patch avoids that sparse reports the following warning messages:
      
      block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different address spaces)
      block/compat_ioctl.c:85:11:    expected unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:85:11:    got void [noderef] <asn:1>*
      block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different address spaces)
      block/compat_ioctl.c:91:21:    expected void const volatile [noderef] <asn:1>*<noident>
      block/compat_ioctl.c:91:21:    got unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:87:53: warning: dereference of noderef expression
      block/compat_ioctl.c:91:21: warning: dereference of noderef expression
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a934bb8
    • W
      block: remove blk_free_devt in add_partition · 47570848
      weiping zhang 提交于
      put_device(pdev) will call pdev->type->release finally, and blk_free_devt
      has been called in part_release(), so remove it.
      Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      47570848