1. 09 2月, 2019 5 次提交
    • C
      bcache: export backing_dev_uuid via sysfs · d4610456
      Coly Li 提交于
      When there are multiple bcache devices, after a reboot the name of
      bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
      before reboot). Therefore we need the backing device UUID (sb.uuid) to
      identify each bcache device.
      
      Backing device uuid can be found by program bcache-super-show, but
      directly exporting backing_dev_uuid by sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.
      
      With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
      now we can identify each bcache device and its partitions conveniently.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4610456
    • C
      bcache: export backing_dev_name via sysfs · 926d1946
      Coly Li 提交于
      This patch export dc->backing_dev_name to sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
      tools may know the backing device name of this bcache device.
      
      Of cause it can be done by parsing sysfs links, but this method can be
      much simpler to find the link between bcache device and backing device.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      926d1946
    • C
      bcache: not use hard coded memset size in bch_cache_accounting_clear() · 83ff9318
      Coly Li 提交于
      In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
      used in memset(). It is because in struct cache_stats, there are 7
      atomic_t type members. This is not good when new members added into
      struct stats, the hard coded number will only clear part of memory.
      
      This patch replaces 'sizeof(unsigned long) * 7' by more generic
      'sizeof(struct cache_stats))', to avoid potential error if new
      member added into struct cache_stats.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      83ff9318
    • D
      bcache: never writeback a discard operation · 9951379b
      Daniel Axtens 提交于
      Some users see panics like the following when performing fstrim on a
      bcached volume:
      
      [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  530.183928] #PF error: [normal kernel read fault]
      [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
      [  530.750887] Oops: 0000 [#1] SMP PTI
      [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
      [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
      [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
      [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
      8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
      [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
      [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
      [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
      [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
      [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
      [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
      [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
      [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
      [  535.851759] Call Trace:
      [  535.970308]  ? mempool_alloc_slab+0x15/0x20
      [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
      [  536.403399]  blk_mq_make_request+0x97/0x4f0
      [  536.607036]  generic_make_request+0x1e2/0x410
      [  536.819164]  submit_bio+0x73/0x150
      [  536.980168]  ? submit_bio+0x73/0x150
      [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
      [  537.391595]  ? _cond_resched+0x1a/0x50
      [  537.573774]  submit_bio_wait+0x59/0x90
      [  537.756105]  blkdev_issue_discard+0x80/0xd0
      [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
      [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
      [  538.324087]  ext4_ioctl+0xea4/0x1530
      [  538.497712]  ? _copy_to_user+0x2a/0x40
      [  538.679632]  do_vfs_ioctl+0xa6/0x600
      [  538.853127]  ? __do_sys_newfstat+0x44/0x70
      [  539.051951]  ksys_ioctl+0x6d/0x80
      [  539.212785]  __x64_sys_ioctl+0x1a/0x20
      [  539.394918]  do_syscall_64+0x5a/0x110
      [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      We have observed it where both:
      1) LVM/devmapper is involved (bcache backing device is LVM volume) and
      2) writeback cache is involved (bcache cache_mode is writeback)
      
      On one machine, we can reliably reproduce it with:
      
       # echo writeback > /sys/block/bcache0/bcache/cache_mode
         (not sure whether above line is required)
       # mount /dev/bcache0 /test
       # for i in {0..10}; do
      	file="$(mktemp /test/zero.XXX)"
      	dd if=/dev/zero of="$file" bs=1M count=256
      	sync
      	rm $file
          done
        # fstrim -v /test
      
      Observing this with tracepoints on, we see the following writes:
      
      fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
      <crash>
      
      Note the final one has different hit/bypass flags.
      
      This is because in should_writeback(), we were hitting a case where
      the partial stripe condition was returning true and so
      should_writeback() was returning true early.
      
      If that hadn't been the case, it would have hit the would_skip test, and
      as would_skip == s->iop.bypass == true, should_writeback() would have
      returned false.
      
      Looking at the git history from 'commit 72c27061 ("bcache: Write out
      full stripes")', it looks like the idea was to optimise for raid5/6:
      
             * If a stripe is already dirty, force writes to that stripe to
      	 writeback mode - to help build up full stripes of dirty data
      
      To fix this issue, make sure that should_writeback() on a discard op
      never returns true.
      
      More details of debugging:
      https://www.spinics.net/lists/linux-bcache/msg06996.html
      
      Previous reports:
       - https://bugzilla.kernel.org/show_bug.cgi?id=201051
       - https://bugzilla.kernel.org/show_bug.cgi?id=196103
       - https://www.spinics.net/lists/linux-bcache/msg06885.html
      
      (Coly Li: minor modification to follow maximum 75 chars per line rule)
      
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 72c27061 ("bcache: Write out full stripes")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9951379b
    • A
      block: avoid setting nr_requests to current value · e5fa8140
      Aleksei Zakharov 提交于
      There's no reason to freeze queue and set nr_requests value
      if current value is the same.
      Signed-off-by: NAleksei Zakharov <zakharov.a.g@yandex.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5fa8140
  2. 08 2月, 2019 1 次提交
    • G
      cdrom: Fix race condition in cdrom_sysctl_register · f25191bb
      Guenter Roeck 提交于
      The following traceback is sometimes seen when booting an image in qemu:
      
      [   54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
      [   54.611085] Fusion MPT base driver 3.04.20
      [   54.611877] Copyright (c) 1999-2008 LSI Corporation
      [   54.616234] Fusion MPT SAS Host driver 3.04.20
      [   54.635139] sysctl duplicate entry: /dev/cdrom//info
      [   54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
      [   54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [   54.641273] Workqueue: events_unbound async_run_entry_fn
      [   54.641273] Call Trace:
      [   54.641273]  dump_stack+0x67/0x90
      [   54.641273]  __register_sysctl_table+0x50b/0x570
      [   54.641273]  ? rcu_read_lock_sched_held+0x6f/0x80
      [   54.641273]  ? kmem_cache_alloc_trace+0x1c7/0x1f0
      [   54.646814]  __register_sysctl_paths+0x1c8/0x1f0
      [   54.646814]  cdrom_sysctl_register.part.7+0xc/0x5f
      [   54.646814]  register_cdrom.cold.24+0x2a/0x33
      [   54.646814]  sr_probe+0x4bd/0x580
      [   54.646814]  ? __driver_attach+0xd0/0xd0
      [   54.646814]  really_probe+0xd6/0x260
      [   54.646814]  ? __driver_attach+0xd0/0xd0
      [   54.646814]  driver_probe_device+0x4a/0xb0
      [   54.646814]  ? __driver_attach+0xd0/0xd0
      [   54.646814]  bus_for_each_drv+0x73/0xc0
      [   54.646814]  __device_attach+0xd6/0x130
      [   54.646814]  bus_probe_device+0x9a/0xb0
      [   54.646814]  device_add+0x40c/0x670
      [   54.646814]  ? __pm_runtime_resume+0x4f/0x80
      [   54.646814]  scsi_sysfs_add_sdev+0x81/0x290
      [   54.646814]  scsi_probe_and_add_lun+0x888/0xc00
      [   54.646814]  ? scsi_autopm_get_host+0x21/0x40
      [   54.646814]  __scsi_add_device+0x116/0x130
      [   54.646814]  ata_scsi_scan_host+0x93/0x1c0
      [   54.646814]  async_run_entry_fn+0x34/0x100
      [   54.646814]  process_one_work+0x237/0x5e0
      [   54.646814]  worker_thread+0x37/0x380
      [   54.646814]  ? rescuer_thread+0x360/0x360
      [   54.646814]  kthread+0x118/0x130
      [   54.646814]  ? kthread_create_on_node+0x60/0x60
      [   54.646814]  ret_from_fork+0x3a/0x50
      
      The only sensible explanation is that cdrom_sysctl_register() is called
      twice, once from the module init function and once from register_cdrom().
      cdrom_sysctl_register() is not mutex protected and may happily execute
      twice if the second call is made before the first call is complete.
      
      Use a static atomic to ensure that the function is executed exactly once.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f25191bb
  3. 05 2月, 2019 4 次提交
  4. 04 2月, 2019 2 次提交
  5. 01 2月, 2019 17 次提交
    • J
      blk-mq: save default hctx into ctx->hctxs for not-supported type · bb94aea1
      Jianchao Wang 提交于
      Currently, we check whether the hctx type is supported every time
      in hot path. Actually, this is not necessary, we could save the
      default hctx into ctx->hctxs if the type is not supported when
      map swqueues and use it directly with ctx->hctxs[type].
      
      We also needn't check whether the poll is enabled or not, because
      the caller would clear the REQ_HIPRI in that case.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bb94aea1
    • J
      blk-mq: save queue mapping result into ctx directly · 8ccdf4a3
      Jianchao Wang 提交于
      Currently, the queue mapping result is saved in a two-dimensional
      array. In the hot path, to get a hctx, we need do following:
      
        q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]
      
      This isn't very efficient. We could save the queue mapping result into
      ctx directly with different hctx type, like,
      
        ctx->hctxs[type]
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ccdf4a3
    • P
      block, bfq: fix in-service-queue check for queue merging · 058fdecc
      Paolo Valente 提交于
      When a new I/O request arrives for a bfq_queue, say Q, bfq checks
      whether that request is close to
      (a) the head request of some other queue waiting to be served, or
      (b) the last request dispatched for the in-service queue (in case Q
      itself is not the in-service queue)
      
      If a queue, say Q2, is found for which the above condition holds, then
      bfq merges Q and Q2, to hopefully get a more sequential I/O in the
      resulting merged queue, and thus a possibly higher throughput.
      
      Case (b) is checked by comparing the new request for Q with the last
      request dispatched, assuming that the latter necessarily belonged to the
      in-service queue. Unfortunately, this assumption is no longer always
      correct, since commit d0edc247 ("block, bfq: inject other-queue I/O
      into seeky idle queues on NCQ flash").
      
      When the assumption does not hold, queues that must not be merged may be
      merged, causing unexpected loss of control on per-queue service
      guarantees.
      
      This commit solves this problem by adding an extra field, which stores
      the actual last request dispatched for the in-service queue, and by
      using this new field to correctly check case (b).
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      058fdecc
    • P
      block, bfq: do not overcharge writes in asymmetric scenarios · 02a6d787
      Paolo Valente 提交于
      Writes tend to starve reads. bfq counters this problem by overcharging
      writes with an inflated service w.r.t. the actual service (number of
      sector written) they receive.
      
      Yet his overcharging is useless, and actually causes unfairness in the
      opposite direction, when bfq happens to be enforcing strong I/O control.
      bfq does this enforcing when the scenario is asymmetric, i.e., when some
      bfq_queue or group of bfq_queues is to be granted a different bandwidth
      than some other bfq_queue or group of bfq_queues. So, in such a
      scenario, this commit disables write overcharging.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      02a6d787
    • P
      block, bfq: port commit "cfq-iosched: improve hw_tag detection" · b3c34981
      Paolo Valente 提交于
      The original commit is commit 1a1238a7 ("cfq-iosched: improve hw_tag
      detection") and has the following commit message:
      
      If active queue hasn't enough requests and idle window opens, cfq will
      not dispatch sufficient requests to hardware. In such situation, current
      code will zero hw_tag. But this is because cfq doesn't dispatch enough
      requests instead of hardware queue doesn't work. Don't zero hw_tag in
      such case.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b3c34981
    • P
      block, bfq: reduce threshold for detecting command queueing · a3c92560
      Paolo Valente 提交于
      bfq simple heuristic from cfq for detecting whether the drive performs
      command queueing: check whether the average number of in-flight requests
      is above a given threshold. Unfortunately this heuristic does fail to
      detect queueing (on drives with queueing) if processes doing I/O are few
      and issue I/O with a low depth.
      
      To reduce false negatives, this commit lowers the threshold.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3c92560
    • P
      block, bfq: fix queue removal from weights tree · 9dee8b3b
      Paolo Valente 提交于
      bfq maintains an ordered list, through a red-black tree, of unique
      weights of active bfq_queues. This list is used to detect whether there
      are active queues with differentiated weights. The weight of a queue is
      removed from the list when both the following two conditions become
      true:
      
      (1) the bfq_queue is flagged as inactive
      (2) the has no in-flight request any longer;
      
      Unfortunately, in the rare cases where condition (2) becomes true before
      condition (1), the removal fails, because the function to remove the
      weight of the queue (bfq_weights_tree_remove) is rightly invoked in the
      path that deactivates the bfq_queue, but mistakenly invoked *before* the
      function that actually performs the deactivation (bfq_deactivate_bfqq).
      
      This commits moves the invocation of bfq_weights_tree_remove for
      condition (1) to after bfq_deactivate_bfqq. As a consequence of this
      move, it is necessary to add a further reference to the queue when the
      weight of a queue is added, because the queue might otherwise be freed
      before bfq_weights_tree_remove is invoked. This commit adds this
      reference and makes all related modifications.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9dee8b3b
    • P
      block, bfq: fix sequential rq detection in rate estimation · d87447d8
      Paolo Valente 提交于
      In bfq_update_peak_rate, to check whether an I/O request rq is
      sequential, only the seek distance of rq w.r.t. the last request
      dispatched is controlled. This is not sufficient for non-rotational
      storage, where the size of rq is at least as relevant. This commit adds
      the missing control.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d87447d8
    • P
      block, bfq: unconditionally plug I/O in asymmetric scenarios · 530c4cbb
      Paolo Valente 提交于
      bfq detects the creation of multiple bfq_queues shortly after each
      other, namely a burst of queue creations in the terminology used in the
      code. If the burst is large, then no queue in the burst is granted
      - either I/O-dispatch plugging when the queue remains temporarily idle
        while in service;
      - or weight raising, because it causes even longer plugging.
      
      In fact, such a plugging tends to lower throughput, while these bursts
      are typically due to applications or services that spawn multiple
      processes, to reach a common goal as soon as possible. Examples are a
      "git grep" or the booting of a system.
      
      Unfortunately, disabling plugging may cause a loss of service guarantees
      in asymmetric scenarios, i.e., if queue weights are differentiated or if
      more than one group is active.
      
      This commit addresses this issue by no longer disabling I/O-dispatch
      plugging for queues in large bursts.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      530c4cbb
    • P
      block, bfq: do not plug I/O of in-service queue when harmful · ac8b0cb4
      Paolo Valente 提交于
      If the in-service bfq_queue is sync and remains temporarily idle, then
      I/O dispatching (from other queues) may be plugged. It may be dome for
      two reasons: either to boost throughput, or to preserve the bandwidth
      share of the in-service queue. In the first case, if the I/O of the
      in-service queue, when it finally arrives, consists only of one small
      I/O request, then it makes sense to plug even the I/O of the in-service
      queue. In fact, serving such a small request immediately is likely to
      lower throughput instead of boosting it, whereas waiting a little bit is
      likely to let that request grow, thanks to request merging, and become
      more profitable in terms of throughput (this is likely to happen exactly
      because the I/O of the queue has been detected to boost throughput).
      
      On the opposite end, if I/O dispatching is being plugged only to
      preserve the bandwidth of the in-service queue, then it would be better
      not to plug also the I/O of the in-service queue, because such a
      plugging is likely to cause only loss of bandwidth for the queue.
      
      Unfortunately, no distinction is made between the two cases, and the I/O
      of the in-service queue is always plugged in case just a small I/O
      request arrives. This commit draws this missing distinction and does not
      perform harmful plugging.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac8b0cb4
    • P
      block, bfq: split function bfq_better_to_idle · 05c2f5c3
      Paolo Valente 提交于
      This is a preparatory commit for commits that need to check only one of
      the two main reasons for idling. This change should also improve the
      quality of the code a little bit, by splitting a function that contains
      very long, non-trivial and little related comments.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c2f5c3
    • P
      block, bfq: consider also ioprio classes in symmetry detection · 73d58118
      Paolo Valente 提交于
      In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
      be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
      these service guaranteed can be provided only by plugging I/O dispatch,
      completely or partially, when the queue in service remains temporarily
      empty. A case where asymmetry is particularly strong is when some active
      bfq_queues belong to a higher-priority class than some other active
      bfq_queues. Unfortunately, this important case is not considered at all
      in the code for detecting asymmetric scenarios. This commit adds the
      missing logic.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73d58118
    • P
      block, bfq: remove case of redirected bic from insert_request · 03e565e4
      Paolo Valente 提交于
      Before commit 18e5a57d ("block, bfq: postpone rq preparation to
      insert or merge"), the destination queue for a request was chosen by a
      different hook than the one that then inserted the request. So, between
      the execution of the two hooks, the bic of the process generating the
      request could happen to be redirected to a different bfq_queue. As a
      consequence, the destination bfq_queue stored in the request could be
      wrong. Such an event does not need to ba handled any longer.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03e565e4
    • P
      block, bfq: make sure queue budgets are not below service received · f3218ad8
      Paolo Valente 提交于
      With some unlucky sequences of events, the function bfq_updated_next_req
      updates the current budget of a bfq_queue to a lower value than the
      service received by the queue using such a budget. Unfortunately, if
      this happens, then the return value of the function bfq_bfqq_budget_left
      becomes inconsistent. This commit solves this problem by lower-bounding
      the budget computed in bfq_updated_next_req to the service currently
      charged to the queue.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3218ad8
    • P
      block, bfq: avoid selecting a queue w/o budget · 218cb897
      Paolo Valente 提交于
      To boost throughput on devices with internal queueing and in scenarios
      where device idling is not strictly needed, bfq immediately starts
      serving a new bfq_queue if the in-service bfq_queue remains without
      pending I/O, even if new I/O may arrive soon for the latter queue. Then,
      if such I/O actually arrives soon, bfq preempts the new in-service
      bfq_queue so as to give the previous queue a chance to go on being
      served (in case the previous queue should actually be the one to be
      served, according to its timestamps).
      
      However, the in-service bfq_queue, say Q, may also be without further
      budget when it remains also pending I/O. Since bfq changes budgets
      dynamically to fit the needs of bfq_queues, this happens more often than
      one may expect. If this happens, then there is no point in trying to go
      on serving Q when new I/O arrives for it soon: Q would be expired
      immediately after being selected for service. This would only cause
      useless overhead. This commit avoids such a useless selection.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      218cb897
    • P
      block, bfq: do not consider interactive queues in srt filtering · 20cd3245
      Paolo Valente 提交于
      The speed at which a bfq_queue receives I/O is one of the parameters by
      which bfq decides whether the queue is soft real-time (i.e., whether the
      queue contains the I/O of a soft real-time application). In particular,
      when a bfq_queue remains without outstanding I/O requests, bfq computes
      the minimum time instant, named soft_rt_next_start, at which the next
      request of the queue may arrive for the queue to be deemed as soft real
      time.
      
      Unfortunately this filtering may cause problems with a queue in
      interactive weight raising. In fact, such a queue may be conveying the
      I/O needed to load a soft real-time application. The latter will
      actually exhibit a soft real-time I/O pattern after it finally starts
      doing its job. But, if soft_rt_next_start is updated for an interactive
      bfq_queue, and the queue has received a lot of service before remaining
      with no outstanding request (likely to happen on a fast device), then
      soft_rt_next_start is assigned such a high value that, for a very long
      time, the queue is prevented from being possibly considered as soft real
      time.
      
      This commit removes the updating of soft_rt_next_start for bfq_queues in
      interactive weight raising.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      20cd3245
    • C
      mtip32xx: ѕtop abusing the managed resource APIs · 22cb4e68
      Christoph Hellwig 提交于
      The mtip32xx driver uses managed resources for DMA coherent memory
      and irqs, but then always pairs them with free calls anyway, making
      the resource tracking rather pointless.  Given some DMA allocations
      are transient anyway, the irq freeing seems to require ordering vs
      other hardware access the best solution seems to be to stop using
      the managed resource API entirely.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22cb4e68
  6. 28 1月, 2019 11 次提交
    • L
      Linux 5.0-rc4 · f17b5f06
      Linus Torvalds 提交于
      f17b5f06
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8a5f0605
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "A set of fixes for x86:
      
         - Fix the swapped outb() parameters in the KASLR code
      
         - Fix the PKEY handling at fork which missed to preserve the pkey
           state for the child. Comes with a test case to validate that.
      
         - Fix the entry stack handling for XEN PV to respect that XEN PV
           systems enter the function already on the current thread stack and
           not on the trampoline.
      
         - Fix kexec load failure caused by using a stale value when the
           kexec_buf structure is reused for subsequent allocations.
      
         - Fix a bogus sizeof() in the memory encryption code
      
         - Enforce PCI dependency for the Intel Low Power Subsystem
      
         - Enforce PCI_LOCKLESS_CONFIG when PCI is enabled"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/Kconfig: Select PCI_LOCKLESS_CONFIG if PCI is enabled
        x86/entry/64/compat: Fix stack switching for XEN PV
        x86/kexec: Fix a kexec_file_load() failure
        x86/mm/mem_encrypt: Fix erroneous sizeof()
        x86/selftests/pkeys: Fork() to check for state being preserved
        x86/pkeys: Properly copy pkey state at fork()
        x86/kaslr: Fix incorrect i8254 outb() parameters
        x86/intel/lpss: Make PCI dependency explicit
      8a5f0605
    • L
      Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 351e1aa6
      Linus Torvalds 提交于
      Pull x86 timer fixes from Thomas Gleixner:
       "Two commits which were missed to be sent during the merge window.
      
         - The TSC calibration fix turns out to be more urgent as recent
           Skylake-X systems seem to have massive trouble with calibration
           disturbance. This should go back into stable for that reason and it
           the risk of breakage is rather low.
      
         - Drop an unused define"
      
      * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/hpet: Remove unused FSEC_PER_NSEC define
        x86/tsc: Make calibration refinement more robust
      351e1aa6
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f907bb4c
      Linus Torvalds 提交于
      Pull timer fix from Thomas Glexiner:
       "A single regression fix to address the unintended breakage of posix
        cpu timers.
      
        This is caused by a new sanity check in the common code, which fails
        for posix cpu timers under certain conditions because the posix cpu
        timer code never updates the variable which is checked"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        posix-cpu-timers: Unbreak timer rearming
      f907bb4c
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 98810518
      Linus Torvalds 提交于
      Pull locking fixes from Thomas Gleixner:
       "A small series of fixes which all address possible missed wakeups:
      
         - Document and fix the wakeup ordering of wake_q
      
         - Add the missing barrier in rcuwait_wake_up(), which was documented
           in the comment but missing in the code
      
         - Fix the possible missed wakeups in the rwsem and futex code"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/rwsem: Fix (possible) missed wakeup
        futex: Fix (possible) missed wakeup
        sched/wake_q: Fix wakeup ordering for wake_q
        sched/wake_q: Document wake_q_add()
        sched/wait: Fix rcuwait_wake_up() ordering
      98810518
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0d484375
      Linus Torvalds 提交于
      Pull irq fixes from Thomas Gleixner:
       "A small set of fixes for the interrupt subsystem:
      
         - Fix a double increment in the irq descriptor allocator which
           resulted in a sanity check only being done for every second
           affinity mask
      
         - Add a missing device tree translation in the stm32-exti driver.
           Without that the interrupt association is completely wrong.
      
         - Initialize the mutex in the GIC-V3 MBI driver
      
         - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
           multi MSI allocations work correctly
      
         - Ensure that the initial affinity of a interrupt is not empty at
           startup time.
      
         - Drop bogus include in the madera irq chip driver
      
         - Fix KernelDoc regression"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
        genirq/irqdesc: Fix double increment in alloc_descs()
        genirq: Fix the kerneldoc comment for struct irq_affinity_desc
        irqchip/madera: Drop GPIO includes
        irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
        irqchip/stm32-exti: Add domain translate function
        genirq: Make sure the initial affinity is not empty
      0d484375
    • L
      Merge tag 'edac_fix_for_5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 98354243
      Linus Torvalds 提交于
      Pull EDAC fix from Borislav Petkov:
       "Fix persistent register offsets of altera_edac, from Thor Thayer"
      
      * tag 'edac_fix_for_5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
        EDAC, altera: Fix S10 persistent register offset
      98354243
    • L
      Merge tag 'for-linus-20190127' of git://git.kernel.dk/linux-block · 419967d5
      Linus Torvalds 提交于
      Pull block revert from Jens Axboe:
       "Silly error snuck into a patch from the last series, let's do a revert
        to avoid a potential use-after-free"
      
      * tag 'for-linus-20190127' of git://git.kernel.dk/linux-block:
        Revert "block: cover another queue enter recursion via BIO_QUEUE_ENTERED"
      419967d5
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 1fc7f56d
      Linus Torvalds 提交于
      Pull KVM fixes from Paolo Bonzini:
       "Quite a few fixes for x86: nested virtualization save/restore, AMD
        nested virtualization and virtual APIC, 32-bit fixes, an important fix
        to restore operation on older processors, and a bunch of hyper-v
        bugfixes. Several are marked stable.
      
        There are also fixes for GCC warnings and for a GCC/objtool interaction"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Mark expected switch fall-throughs
        KVM: x86: fix TRACE_INCLUDE_PATH and remove -I. header search paths
        KVM: selftests: check returned evmcs version range
        x86/kvm/hyper-v: nested_enable_evmcs() sets vmcs_version incorrectly
        KVM: VMX: Move vmx_vcpu_run()'s VM-Enter asm blob to a helper function
        kvm: selftests: Fix region overlap check in kvm_util
        kvm: vmx: fix some -Wmissing-prototypes warnings
        KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1
        svm: Fix AVIC incomplete IPI emulation
        svm: Add warning message for AVIC IPI invalid target
        KVM: x86: WARN_ONCE if sending a PV IPI returns a fatal error
        KVM: x86: Fix PV IPIs for 32-bit KVM host
        x86/kvm/hyper-v: recommend using eVMCS only when it is enabled
        x86/kvm/hyper-v: don't recommend doing reset via synthetic MSR
        kvm: x86/vmx: Use kzalloc for cached_vmcs12
        KVM: VMX: Use the correct field var when clearing VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL
        KVM: x86: Fix single-step debugging
        x86/kvm/hyper-v: don't announce GUEST IDLE MSR support
      1fc7f56d
    • L
      Merge tag 'dma-mapping-5.0-2' of git://git.infradead.org/users/hch/dma-mapping · c180f1b0
      Linus Torvalds 提交于
      Pull dma-mapping fix from Christoph Hellwig:
       "Fix a xen-swiotlb regression on arm64"
      
      * tag 'dma-mapping-5.0-2' of git://git.infradead.org/users/hch/dma-mapping:
        arm64/xen: fix xen-swiotlb cache flushing
      c180f1b0
    • L
      Merge tag 'libnvdimm-fixes-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 6a2651b5
      Linus Torvalds 提交于
      Pull libnvdimm fixes from Dan Williams:
       "A fix for namespace label support for non-Intel NVDIMMs that implement
        the ACPI standard label method.
      
        This has apparently never worked and could wait for v5.1. However it
        has enough visibility with hardware vendors [1] and distro bug
        trackers [2], and low enough risk that I decided it should go in for
        -rc4. The other fixups target the new, for v5.0, nvdimm security
        functionality. The larger init path fixup closes a memory leak and a
        potential userspace lockup due to missed notifications.
      
          [1] https://github.com/pmem/ndctl/issues/78
          [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1811785
      
        These have all soaked in -next for a week with no reported issues.
      
        Summary:
      
         - Fix support for NVDIMMs that implement the ACPI standard label
           methods.
      
         - Fix error handling for security overwrite (memory leak / userspace
           hang condition), and another one-line security cleanup"
      
      * tag 'libnvdimm-fixes-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        acpi/nfit: Fix command-supported detection
        acpi/nfit: Block function zero DSMs
        libnvdimm/security: Require nvdimm_security_setup_events() to succeed
        nfit_test: fix security state pull for nvdimm security nfit_test
      6a2651b5