1. 16 6月, 2020 23 次提交
  2. 15 6月, 2020 15 次提交
    • A
      nvme: retain split access workaround for capability reads · c2fe0cbc
      Ard Biesheuvel 提交于
      task #28557808
      
      [ Upstream commit 3a8ecc935efabdad106b5e06d07b150c394b4465 ]
      
      Commit 7fd8930f
      
        "nvme: add a common helper to read Identify Controller data"
      
      has re-introduced an issue that we have attempted to work around in the
      past, in commit a310acd7 ("NVMe: use split lo_hi_{read,write}q").
      
      The problem is that some PCIe NVMe controllers do not implement 64-bit
      outbound accesses correctly, which is why the commit above switched
      to using lo_hi_[read|write]q for all 64-bit BAR accesses occuring in
      the code.
      
      In the mean time, the NVMe subsystem has been refactored, and now calls
      into the PCIe support layer for NVMe via a .reg_read64() method, which
      fails to use lo_hi_readq(), and thus reintroduces the problem that the
      workaround above aimed to address.
      
      Given that, at the moment, .reg_read64() is only used to read the
      capability register [which is known to tolerate split reads], let's
      switch .reg_read64() to lo_hi_readq() as well.
      
      This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys
      DesignWare PCIe host controller.
      
      Fixes: 7fd8930f ("nvme: add a common helper to read Identify Controller data")
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c2fe0cbc
    • E
      nvme: Discard workaround for non-conformant devices · fd00911e
      Eduard Hasenleithner 提交于
      task #28557808
      
      [ Upstream commit 530436c45ef2e446c12538a400e465929a0b3ade ]
      
      Users observe IOMMU related errors when performing discard on nvme from
      non-compliant nvme devices reading beyond the end of the DMA mapped
      ranges to discard.
      
      Two different variants of this behavior have been observed: SM22XX
      controllers round up the read size to a multiple of 512 bytes, and Phison
      E12 unconditionally reads the maximum discard size allowed by the spec
      (256 segments or 4kB).
      
      Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
      so the driver DMA maps a memory range that will always succeed.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
      Signed-off-by: NEduard Hasenleithner <eduard@hasenleithner.at>
      [changelog, use existing define, kernel coding style]
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fd00911e
    • G
      dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath · 533c5ba8
      Gabriel Krisman Bertazi 提交于
      task #28557827
      
      commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.
      
      When adding devices that don't have a scsi_dh on a BIO based multipath,
      I was able to consistently hit the warning below and lock-up the system.
      
      The problem is that __map_bio reads the flag before it potentially being
      modified by choose_pgpath, and ends up using the older value.
      
      The WARN_ON below is not trivially linked to the issue. It goes like
      this: The activate_path delayed_work is not initialized for non-scsi_dh
      devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
      That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
      Nevertheless, only for BIO-based mpath, we cache the flag before calling
      choose_pgpath, and use the older version when deciding if we should
      initialize the path.  Therefore, we end up trying to initialize the
      paths, and calling the non-initialized activate_path work.
      
      [   82.437100] ------------[ cut here ]------------
      [   82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
        __queue_delayed_work+0x71/0x90
      [   82.438436] Modules linked in:
      [   82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
      [   82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
      [   82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
      94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
      a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
      [   82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
      [   82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
      [   82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
      [   82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
      [   82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
      [   82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
      [   82.445427] FS:  00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
      [   82.446040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
      [   82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   82.448012] Call Trace:
      [   82.448164]  queue_delayed_work_on+0x6d/0x80
      [   82.448472]  __pg_init_all_paths+0x7b/0xf0
      [   82.448714]  pg_init_all_paths+0x26/0x40
      [   82.448980]  __multipath_map_bio.isra.0+0x84/0x210
      [   82.449267]  __map_bio+0x3c/0x1f0
      [   82.449468]  __split_and_process_non_flush+0x14a/0x1b0
      [   82.449775]  __split_and_process_bio+0xde/0x340
      [   82.450045]  ? dm_get_live_table+0x5/0xb0
      [   82.450278]  dm_process_bio+0x98/0x290
      [   82.450518]  dm_make_request+0x54/0x120
      [   82.450778]  generic_make_request+0xd2/0x3e0
      [   82.451038]  ? submit_bio+0x3c/0x150
      [   82.451278]  submit_bio+0x3c/0x150
      [   82.451492]  mpage_readpages+0x129/0x160
      [   82.451756]  ? bdev_evict_inode+0x1d0/0x1d0
      [   82.452033]  read_pages+0x72/0x170
      [   82.452260]  __do_page_cache_readahead+0x1ba/0x1d0
      [   82.452624]  force_page_cache_readahead+0x96/0x110
      [   82.452903]  generic_file_read_iter+0x84f/0xae0
      [   82.453192]  ? __seccomp_filter+0x7c/0x670
      [   82.453547]  new_sync_read+0x10e/0x190
      [   82.453883]  vfs_read+0x9d/0x150
      [   82.454172]  ksys_read+0x65/0xe0
      [   82.454466]  do_syscall_64+0x4e/0x210
      [   82.454828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [...]
      [   82.462501] ---[ end trace bb39975e9cf45daa ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      533c5ba8
    • M
      dm: fix potential for q->make_request_fn NULL pointer · e572e842
      Mike Snitzer 提交于
      task #28557827
      
      commit 47ace7e012b9f7ad71d43ac9063d335ea3d6820b upstream.
      
      Move blk_queue_make_request() to dm.c:alloc_dev() so that
      q->make_request_fn is never NULL during the lifetime of a DM device
      (even one that is created without a DM table).
      
      Otherwise generic_make_request() will crash simply by doing:
        dmsetup create -n test
        mount /dev/dm-N /mnt
      
      While at it, move ->congested_data initialization out of
      dm.c:alloc_dev() and into the bio-based specific init method.
      Reported-by: NStefan Bader <stefan.bader@canonical.com>
      BugLink: https://bugs.launchpad.net/bugs/1860231
      Fixes: ff36ab34 ("dm: remove request-based logic from make_request_fn wrapper")
      Depends-on: c12c9a3c ("dm: various cleanups to md->queue initialization code")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e572e842
    • M
      dm crypt: fix benbi IV constructor crash if used in authenticated mode · e76e922c
      Milan Broz 提交于
      task #28557827
      
      commit 4ea9471fbd1addb25a4d269991dc724e200ca5b5 upstream.
      
      If benbi IV is used in AEAD construction, for example:
        cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
      the constructor uses wrong skcipher function and crashes:
      
       BUG: kernel NULL pointer dereference, address: 00000014
       ...
       EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
       Call Trace:
        ? crypt_subkey_size+0x20/0x20 [dm_crypt]
        crypt_ctr+0x567/0xfc0 [dm_crypt]
        dm_table_add_target+0x15f/0x340 [dm_mod]
      
      Fix this by properly using crypt_aead_blocksize() in this case.
      
      Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
      Cc: stable@vger.kernel.org # v4.12+
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051Reported-by: NJerad Simpson <jbsimpson@gmail.com>
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e76e922c
    • J
      dm space map common: fix to ensure new block isn't already in use · d78b9658
      Joe Thornber 提交于
      task #28557827
      
      commit 4feaef830de7ffdd8352e1fe14ad3bf13c9688f8 upstream.
      
      The space-maps track the reference counts for disk blocks allocated by
      both the thin-provisioning and cache targets.  There are variants for
      tracking metadata blocks and data blocks.
      
      Transactionality is implemented by never touching blocks from the
      previous transaction, so we can rollback in the event of a crash.
      
      When allocating a new block we need to ensure the block is free (has
      reference count of 0) in both the current and previous transaction.
      Prior to this fix we were doing this by searching for a free block in
      the previous transaction, and relying on a 'begin' counter to track
      where the last allocation in the current transaction was.  This
      'begin' field was not being updated in all code paths (eg, increment
      of a data block reference count due to breaking sharing of a neighbour
      block in the same btree leaf).
      
      This fix keeps the 'begin' field, but now it's just a hint to speed up
      the search.  Instead the current transaction is searched for a free
      block, and then the old transaction is double checked to ensure it's
      free.  Much simpler.
      
      This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
      DM thin-provisioning's snapshots are heavily used.
      Reported-by: NEric Wheeler <dm-devel@lists.ewheeler.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d78b9658
    • S
      virtio-blk: handle block_device_operations callbacks after hot unplug · ef05ab79
      Stefan Hajnoczi 提交于
      task #28557821
      
      [ Upstream commit 90b5feb8c4bebc76c27fcaf3e1a0e5ca2d319e9e ]
      
      A userspace process holding a file descriptor to a virtio_blk device can
      still invoke block_device_operations after hot unplug.  This leads to a
      use-after-free accessing vblk->vdev in virtblk_getgeo() when
      ioctl(HDIO_GETGEO) is invoked:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
        IP: [<ffffffffc00e5450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
        PGD 800000003a92f067 PUD 3a930067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 1310 Comm: hdio-getgeo Tainted: G           OE  ------------   3.10.0-1062.el7.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        task: ffff9be5fbfb8000 ti: ffff9be5fa890000 task.ti: ffff9be5fa890000
        RIP: 0010:[<ffffffffc00e5450>]  [<ffffffffc00e5450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
        RSP: 0018:ffff9be5fa893dc8  EFLAGS: 00010246
        RAX: ffff9be5fc3f3400 RBX: ffff9be5fa893e30 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff9be5fbc10b40
        RBP: ffff9be5fa893dc8 R08: 0000000000000301 R09: 0000000000000301
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff9be5fdc24680
        R13: ffff9be5fbc10b40 R14: ffff9be5fbc10480 R15: 0000000000000000
        FS:  00007f1bfb968740(0000) GS:ffff9be5ffc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000090 CR3: 000000003a894000 CR4: 0000000000360ff0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         [<ffffffffc016ac37>] virtblk_getgeo+0x47/0x110 [virtio_blk]
         [<ffffffff8d3f200d>] ? handle_mm_fault+0x39d/0x9b0
         [<ffffffff8d561265>] blkdev_ioctl+0x1f5/0xa20
         [<ffffffff8d488771>] block_ioctl+0x41/0x50
         [<ffffffff8d45d9e0>] do_vfs_ioctl+0x3a0/0x5a0
         [<ffffffff8d45dc81>] SyS_ioctl+0xa1/0xc0
      
      A related problem is that virtblk_remove() leaks the vd_index_ida index
      when something still holds a reference to vblk->disk during hot unplug.
      This causes virtio-blk device names to be lost (vda, vdb, etc).
      
      Fix these issues by protecting vblk->vdev with a mutex and reference
      counting vblk so the vd_index_ida index can be removed in all cases.
      
      Fixes: 48e4043d ("virtio: add virtio disk geometry feature")
      Reported-by: NLance Digby <ldigby@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Link: https://lore.kernel.org/r/20200430140442.171016-1-stefanha@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ef05ab79
    • H
      virtio-blk: improve virtqueue error to BLK_STS · e322fee8
      Halil Pasic 提交于
      task #28557821
      
      [ Upstream commit 3d973b2e9a625996ee997c7303cd793b9d197c65 ]
      
      Let's change the mapping between virtqueue_add errors to BLK_STS
      statuses, so that -ENOSPC, which indicates virtqueue full is still
      mapped to BLK_STS_DEV_RESOURCE, but -ENOMEM which indicates non-device
      specific resource outage is mapped to BLK_STS_RESOURCE.
      Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
      Link: https://lore.kernel.org/r/20200213123728.61216-3-pasic@linux.ibm.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e322fee8
    • H
      virtio-blk: fix hw_queue stopped on arbitrary error · c93ef89d
      Halil Pasic 提交于
      task #28557821
      
      commit f5f6b95c72f7f8bb46eace8c5306c752d0133daa upstream.
      
      Since nobody else is going to restart our hw_queue for us, the
      blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient
      necessarily sufficient to ensure that the queue will get started again.
      In case of global resource outage (-ENOMEM because mapping failure,
      because of swiotlb full) our virtqueue may be empty and we can get
      stuck with a stopped hw_queue.
      
      Let us not stop the queue on arbitrary errors, but only on -EONSPC which
      indicates a full virtqueue, where the hw_queue is guaranteed to get
      started by virtblk_done() before when it makes sense to carry on
      submitting requests. Let us also remove a stale comment.
      Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Fixes: f7728002c1c7 ("virtio_ring: fix return code on DMA mapping fails")
      Link: https://lore.kernel.org/r/20200213123728.61216-2-pasic@linux.ibm.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c93ef89d
    • Z
      block, bfq: fix use-after-free in bfq_idle_slice_timer_body · baecb6b1
      Zhiqiang Liu 提交于
      task #28557799
      
      [ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]
      
      In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
      not in bfqd-lock critical section. The bfqq, which is not
      equal to NULL in bfq_idle_slice_timer, may be freed after passing
      to bfq_idle_slice_timer_body. So we will access the freed memory.
      
      In addition, considering the bfqq may be in race, we should
      firstly check whether bfqq is in service before doing something
      on it in bfq_idle_slice_timer_body func. If the bfqq in race is
      not in service, it means the bfqq has been expired through
      __bfq_bfqq_expire func, and wait_request flags has been cleared in
      __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
      wait_request of bfqq which is not in service.
      
      KASAN log is given as follows:
      [13058.354613] ==================================================================
      [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
      [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
      [13058.354646]
      [13058.354655] CPU: 96 PID: 19767 Comm: fork13
      [13058.354661] Call trace:
      [13058.354667]  dump_backtrace+0x0/0x310
      [13058.354672]  show_stack+0x28/0x38
      [13058.354681]  dump_stack+0xd8/0x108
      [13058.354687]  print_address_description+0x68/0x2d0
      [13058.354690]  kasan_report+0x124/0x2e0
      [13058.354697]  __asan_load8+0x88/0xb0
      [13058.354702]  bfq_idle_slice_timer+0xac/0x290
      [13058.354707]  __hrtimer_run_queues+0x298/0x8b8
      [13058.354710]  hrtimer_interrupt+0x1b8/0x678
      [13058.354716]  arch_timer_handler_phys+0x4c/0x78
      [13058.354722]  handle_percpu_devid_irq+0xf0/0x558
      [13058.354731]  generic_handle_irq+0x50/0x70
      [13058.354735]  __handle_domain_irq+0x94/0x110
      [13058.354739]  gic_handle_irq+0x8c/0x1b0
      [13058.354742]  el1_irq+0xb8/0x140
      [13058.354748]  do_wp_page+0x260/0xe28
      [13058.354752]  __handle_mm_fault+0x8ec/0x9b0
      [13058.354756]  handle_mm_fault+0x280/0x460
      [13058.354762]  do_page_fault+0x3ec/0x890
      [13058.354765]  do_mem_abort+0xc0/0x1b0
      [13058.354768]  el0_da+0x24/0x28
      [13058.354770]
      [13058.354773] Allocated by task 19731:
      [13058.354780]  kasan_kmalloc+0xe0/0x190
      [13058.354784]  kasan_slab_alloc+0x14/0x20
      [13058.354788]  kmem_cache_alloc_node+0x130/0x440
      [13058.354793]  bfq_get_queue+0x138/0x858
      [13058.354797]  bfq_get_bfqq_handle_split+0xd4/0x328
      [13058.354801]  bfq_init_rq+0x1f4/0x1180
      [13058.354806]  bfq_insert_requests+0x264/0x1c98
      [13058.354811]  blk_mq_sched_insert_requests+0x1c4/0x488
      [13058.354818]  blk_mq_flush_plug_list+0x2d4/0x6e0
      [13058.354826]  blk_flush_plug_list+0x230/0x548
      [13058.354830]  blk_finish_plug+0x60/0x80
      [13058.354838]  read_pages+0xec/0x2c0
      [13058.354842]  __do_page_cache_readahead+0x374/0x438
      [13058.354846]  ondemand_readahead+0x24c/0x6b0
      [13058.354851]  page_cache_sync_readahead+0x17c/0x2f8
      [13058.354858]  generic_file_buffered_read+0x588/0xc58
      [13058.354862]  generic_file_read_iter+0x1b4/0x278
      [13058.354965]  ext4_file_read_iter+0xa8/0x1d8 [ext4]
      [13058.354972]  __vfs_read+0x238/0x320
      [13058.354976]  vfs_read+0xbc/0x1c0
      [13058.354980]  ksys_read+0xdc/0x1b8
      [13058.354984]  __arm64_sys_read+0x50/0x60
      [13058.354990]  el0_svc_common+0xb4/0x1d8
      [13058.354994]  el0_svc_handler+0x50/0xa8
      [13058.354998]  el0_svc+0x8/0xc
      [13058.354999]
      [13058.355001] Freed by task 19731:
      [13058.355007]  __kasan_slab_free+0x120/0x228
      [13058.355010]  kasan_slab_free+0x10/0x18
      [13058.355014]  kmem_cache_free+0x288/0x3f0
      [13058.355018]  bfq_put_queue+0x134/0x208
      [13058.355022]  bfq_exit_icq_bfqq+0x164/0x348
      [13058.355026]  bfq_exit_icq+0x28/0x40
      [13058.355030]  ioc_exit_icq+0xa0/0x150
      [13058.355035]  put_io_context_active+0x250/0x438
      [13058.355038]  exit_io_context+0xd0/0x138
      [13058.355045]  do_exit+0x734/0xc58
      [13058.355050]  do_group_exit+0x78/0x220
      [13058.355054]  __wake_up_parent+0x0/0x50
      [13058.355058]  el0_svc_common+0xb4/0x1d8
      [13058.355062]  el0_svc_handler+0x50/0xa8
      [13058.355066]  el0_svc+0x8/0xc
      [13058.355067]
      [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
      [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
      [13058.355077] The buggy address belongs to the page:
      [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
      [13058.366175] flags: 0x2ffffe0000008100(slab|head)
      [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
      [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
      [13058.370789] page dumped because: kasan: bad access detected
      [13058.370791]
      [13058.370792] Memory state around the buggy address:
      [13058.370797]  ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
      [13058.370801]  ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370808]                                                                 ^
      [13058.370811]  ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370815]  ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      [13058.370817] ==================================================================
      [13058.370820] Disabling lock debugging due to kernel taint
      
      Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
      --
      V2->V3: rewrite the comment as suggested by Paolo Valente
      V1->V2: add one comment, and add Fixes and Reported-by tag.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Acked-by: NPaolo Valente <paolo.valente@linaro.org>
      Reported-by: NWang Wang <wangwang2@huawei.com>
      Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      baecb6b1
    • S
      block: Fix use-after-free issue accessing struct io_cq · fba123ba
      Sahitya Tummala 提交于
      task #28557799
      
      [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]
      
      There is a potential race between ioc_release_fn() and
      ioc_clear_queue() as shown below, due to which below kernel
      crash is observed. It also can result into use-after-free
      issue.
      
      context#1:				context#2:
      ioc_release_fn()			__ioc_clear_queue() gets the same icq
      ->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
      ->ioc_destroy_icq(icq);
        ->list_del_init(&icq->q_node);
        ->call_rcu(&icq->__rcu_head,
        	icq_free_icq_rcu);
      ->spin_unlock(&ioc->lock);
      					->ioc_destroy_icq(icq);
      					  ->hlist_del_init(&icq->ioc_node);
      					  This results into below crash as this memory
      					  is now used by icq->__rcu_head in context#1.
      					  There is a chance that icq could be free'd
      					  as well.
      
      22150.386550:   <6> Unable to handle kernel write to read-only memory
      at virtual address ffffffaa8d31ca50
      ...
      Call trace:
      22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
      22150.611202:   <2>  ioc_clear_queue+0xac/0x148
      22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
      22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
      22150.623465:   <2>  scsi_forget_host+0x2c/0x78
      22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
      22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
      22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
      22150.639665:   <2>  device_release_driver_internal+0x198/0x250
      22150.644897:   <2>  device_release_driver+0x24/0x30
      22150.649176:   <2>  bus_remove_device+0xec/0x140
      22150.653204:   <2>  device_del+0x270/0x460
      22150.656712:   <2>  usb_disable_device+0x120/0x390
      22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
      22150.664684:   <2>  hub_event+0xd70/0x17e8
      22150.668197:   <2>  process_one_work+0x210/0x480
      22150.672222:   <2>  worker_thread+0x32c/0x4c8
      
      Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
      indicate this icq is once marked as destroyed. Also, ensure
      __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
      that icq doesn't get free'd up while it is still using it.
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fba123ba
    • K
      block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices · dc938b41
      Konstantin Khlebnikov 提交于
      task #28557799
      
      [ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]
      
      Field bdi->io_pages added in commit 9491ae4a ("mm: don't cap request
      size based on read-ahead setting") removes unneeded split of read requests.
      
      Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
      limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
      Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.
      
      This patch updates io_pages after merging limits in disk_stack_limits().
      
      Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
      the same problem for device-mapper devices, this one fixes MD RAIDs.
      
      Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
      Reviewed-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      dc938b41
    • C
      block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() · 8cf52b47
      Carlo Nonato 提交于
      task #28557799
      
      [ Upstream commit 14afc59361976c0ba39e3a9589c3eaa43ebc7e1d ]
      
      The bfq_find_set_group() function takes as input a blkcg (which represents
      a cgroup) and retrieves the corresponding bfq_group, then it updates the
      bfq internal group hierarchy (see comments inside the function for why
      this is needed) and finally it returns the bfq_group.
      In the hierarchy update cycle, the pointer holding the correct bfq_group
      that has to be returned is mistakenly used to traverse the hierarchy
      bottom to top, meaning that in each iteration it gets overwritten with the
      parent of the current group. Since the update cycle stops at root's
      children (depth = 2), the overwrite becomes a problem only if the blkcg
      describes a cgroup at a hierarchy level deeper than that (depth > 2). In
      this case the root's child that happens to be also an ancestor of the
      correct bfq_group is returned. The main consequence is that processes
      contained in a cgroup at depth greater than 2 are wrongly placed in the
      group described above by BFQ.
      
      This commits fixes this problem by using a different bfq_group pointer in
      the update cycle in order to avoid the overwrite of the variable holding
      the original group reference.
      Reported-by: NKwon Je Oh <kwonje.oh2@gmail.com>
      Signed-off-by: NCarlo Nonato <carlo.nonato95@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8cf52b47
    • M
      block: fix an integer overflow in logical block size · 8b05616d
      Mikulas Patocka 提交于
      task #28557799
      
      commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.
      
      Logical block size has type unsigned short. That means that it can be at
      most 32768. However, there are architectures that can run with 64k pages
      (for example arm64) and on these architectures, it may be possible to
      create block devices with 64k block size.
      
      For exmaple (run this on an architecture with 64k pages):
      
      Mount will fail with this error because it tries to read the superblock using 2-sector
      access:
        device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
        EXT4-fs (dm-0): unable to read superblock
      
      This patch changes the logical block size from unsigned short to unsigned
      int to avoid the overflow.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8b05616d
    • Y
      block: fix memleak when __blk_rq_map_user_iov() is failed · a15ce925
      Yang Yingliang 提交于
      task #28557799
      
      [ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]
      
      When I doing fuzzy test, get the memleak report:
      
      BUG: memory leak
      unreferenced object 0xffff88837af80000 (size 4096):
        comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00   ...............
        backtrace:
          [<000000001c894df8>] bio_alloc_bioset+0x393/0x590
          [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0
          [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0
          [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160
          [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870
          [<00000000064bb208>] sg_write.part.25+0x5d9/0x950
          [<000000004fc670f6>] sg_write+0x5f/0x8c
          [<00000000b0d05c7b>] __vfs_write+0x7c/0x100
          [<000000008e177714>] vfs_write+0x1c3/0x500
          [<0000000087d23f34>] ksys_write+0xf9/0x200
          [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0
          [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
      the bio(s) which is allocated before this failing will leak. The
      refcount of the bio(s) is init to 1 and increased to 2 by calling
      bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
      the bio cannot be freed. Fix it by calling blk_rq_unmap_user().
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a15ce925
  3. 12 6月, 2020 2 次提交
    • J
      io_uring: check file O_NONBLOCK state for accept · e8758c9b
      Jiufei Xue 提交于
      task #27774850
      
      commit e697deed834de15d2322d0619d51893022c90ea2 upstream.
      
      If the socket is O_NONBLOCK, we should complete the accept request
      with -EAGAIN when data is not ready.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e8758c9b
    • J
      ext4: fix partial cluster initialization when splitting extent · b4a0105f
      Jeffle Xu 提交于
      fix #27212891
      
      Fix the bug when calculating the physical block number of the first
      block in the split extent.
      
      This bug will cause xfstests shared/298 failure on ext4 with bigalloc
      enabled occasionally. Ext4 error messages indicate that previously freed
      blocks are being freed again, and the following fsck will fail due to
      the inconsistency of block bitmap and bg descriptor.
      
      The following is an example case:
      
      1. First, Initialize a ext4 filesystem with cluster size '16K', block size
      '4K', in which case, one cluster contains four blocks.
      
      2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
      tree of this file is like:
      
      ...
      36864:[0]4:220160
      36868:[0]14332:145408
      51200:[0]2:231424
      ...
      
      3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
      like:
      
      ..
      ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
      ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
      ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
      ...
      
      4. Then the extent tree of this file after punching is like
      
      ...
      49507:[0]37:158047
      49547:[0]58:158087
      ...
      
      5. Detailed procedure of punching hole [49544, 49546]
      
      5.1. The block address space:
      ```
      lblk        ~49505  49506   49507~49543     49544~49546    49547~
      	  ---------+------+-------------+----------------+--------
      	    extent | hole |   extent	|	hole	 | extent
      	  ---------+------+-------------+----------------+--------
      pblk       ~158045  158046  158047~158083  158084~158086   158087~
      ```
      
      5.2. The detailed layout of cluster 39521:
      ```
      		cluster 39521
      	<------------------------------->
      
      		hole		  extent
      	<----------------------><--------
      
      lblk      49544   49545   49546   49547
      	+-------+-------+-------+-------+
      	|	|	|	|	|
      	+-------+-------+-------+-------+
      pblk     158084  1580845  158086  158087
      ```
      
      5.3. The ftrace output when punching hole [49544, 49546]:
      - ext4_ext_remove_space (start 49544, end 49546)
        - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
          - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
            - ext4_free_blocks: (block 158084 count 4)
              - ext4_mballoc_free (extent 1/6753/1)
      
      5.4. Ext4 error message in dmesg:
      EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
      EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters
      
      In this case, the whole cluster 39521 is freed mistakenly when freeing
      pblock 158084~158086 (i.e., the first three blocks of this cluster),
      although pblock 158087 (the last remaining block of this cluster) has
      not been freed yet.
      
      The root cause of this isuue is that, the pclu of the partial cluster is
      calculated mistakenly in ext4_ext_remove_space(). The correct
      partial_cluster.pclu (i.e., the cluster number of the first block in the
      next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
      than 39522.
      
      Fixes: f4226d9e ("ext4: fix partial cluster initialization")
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NEric Whitney <enwlinux@gmail.com>
      Cc: stable@kernel.org # v3.19+
      Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b4a0105f