1. 18 3月, 2020 40 次提交
    • J
      io_uring: ensure req->file is cleared on allocation · deec7877
      Jens Axboe 提交于
      commit 60c112b0ada09826cc4ae6a4e55df677f76f1313 upstream.
      
      Stephen reports:
      
      I hit the following General Protection Fault when testing io_uring via
      the io_uring engine in fio. This was on a VM running 5.2-rc5 and the
      latest version of fio. The issue occurs for both null_blk and fake NVMe
      drives. I have not tested bare metal or real NVMe SSDs. The fio script
      used is given below.
      
      [io_uring]
      time_based=1
      runtime=60
      filename=/dev/nvme2n1 (note /dev/nullb0 also fails)
      ioengine=io_uring
      bs=4k
      rw=readwrite
      direct=1
      fixedbufs=1
      sqthread_poll=1
      sqthread_poll_cpu=0
      
      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      RIP: 0010:fput_many+0x7/0x90
      Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \
      
      RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246
      RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5
      RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d
      R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000
      R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004
      FS:  0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? fput+0x13/0x20
       io_free_req+0x20/0x40
       io_put_req+0x1b/0x20
       io_submit_sqe+0x40a/0x680
       ? __switch_to_asm+0x34/0x70
       ? __switch_to_asm+0x40/0x70
       io_submit_sqes+0xb9/0x160
       ? io_submit_sqes+0xb9/0x160
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       ? __switch_to_asm+0x34/0x70
       io_sq_thread+0x1af/0x470
       ? __switch_to_asm+0x34/0x70
       ? wait_woken+0x80/0x80
       ? __switch_to+0x85/0x410
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread_destroy_worker+0x50/0x50
       ret_from_fork+0x35/0x40
      
      which occurs because using a kernel side submission thread isn't valid
      without using fixed files (registered through io_uring_register()). This
      causes io_uring to put the request after logging an error, but before
      the file field is set in the request. If it happens to be non-zero, we
      attempt to fput() garbage.
      
      Fix this by ensuring that req->file is initialized when the request is
      allocated.
      
      Cc: stable@vger.kernel.org # 5.1+
      Reported-by: NStephen Bates <sbates@raithlin.com>
      Tested-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      deec7877
    • E
      io_uring: fix memory leak of UNIX domain socket inode · dc61e2f4
      Eric Biggers 提交于
      commit 355e8d26f719c207aa2e00e6f3cfab3acf21769b upstream.
      
      Opening and closing an io_uring instance leaks a UNIX domain socket
      inode.  This is because the ->file of the io_uring instance's internal
      UNIX domain socket is set to point to the io_uring file, but then
      sock_release() sees the non-NULL ->file and assumes the inode reference
      is held by the file so doesn't call iput().  That's not the case here,
      since the reference is still meant to be held by the socket; the actual
      inode of the io_uring file is different.
      
      Fix this leak by NULL-ing out ->file before releasing the socket.
      
      Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com
      Fixes: 2b188cc1bb85 ("Add io_uring IO interface")
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dc61e2f4
    • J
      io_uring: punt short reads to async context · c599edd9
      Jens Axboe 提交于
      commit 9d93a3f5a0c0d0f79aebc597d47c7cedc852aeb5 upstream.
      
      We can encounter a short read when we're doing buffered reads and the
      data is partially cached. Right now we just return the short read, but
      that forces the application to read that CQE, then issue another SQE
      to finish the read. That read will not be cached, and hence will result
      in an async punt.
      
      It's more efficient to do that async punt from within the kernel, as
      that will the not need two round trips more to the kernel.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c599edd9
    • J
      uio: make import_iovec()/compat_import_iovec() return bytes on success · 0c13034a
      Jens Axboe 提交于
      commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream.
      
      Currently these functions return < 0 on error, and 0 for success.
      Change that so that we return < 0 on error, but number of bytes
      for success.
      
      Some callers already treat the return value that way, others need a
      slight tweak.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0c13034a
    • P
      io_uring: Fix __io_uring_register() false success · cb67bab8
      Pavel Begunkov 提交于
      commit a278682dad37fd2f8d2f30d8e84e376a856ab472 upstream.
      
      If io_copy_iov() fails, it will break the loop and report success,
      albeit partially completed operation.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      cb67bab8
    • J
      tools/io_uring: sync with liburing · a520af94
      Jens Axboe 提交于
      commit 004d564f908790efe815a6510a542ac1227ef2a2 upstream.
      
      Various fixes and changes have been applied to liburing since we
      copied some select bits to the kernel testing/examples part, sync
      up with liburing to get those changes.
      
      Most notable is the change that split the CQE reading into the peek
      and seen event, instead of being just a single function. Also fixes
      an unsigned wrap issue in io_uring_submit(), leak of 'fd' in setup
      if we fail, and various other little issues.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a520af94
    • J
      tools/io_uring: fix Makefile for pthread library link · 16770031
      Jens Axboe 提交于
      commit 486f069253c3c738dec62daeb16f7232b2cca065 upstream.
      
      Currently fails with:
      
      io_uring-bench.o: In function `main':
      /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:560: undefined reference to `pthread_create'
      /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:588: undefined reference to `pthread_join'
      collect2: error: ld returned 1 exit status
      Makefile:11: recipe for target 'io_uring-bench' failed
      make: *** [io_uring-bench] Error 1
      
      Move -lpthread to the end.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      16770031
    • J
      blk-mq: fix NULL pointer deference in case no poll implementation · 1e41c505
      Joseph Qi 提交于
      In case some drivers such virtio-blk, poll function is not implementatin
      yet. Before commit 529262d5 ("block: remove ->poll_fn"), q->poll_fn
      is NULL and then blk_poll() won't do poll actually.
      So add a check for this to avoid NULL pointer dereference when calling
      q->mq_ops->poll.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      1e41c505
    • J
      io_uring: use wait_event_interruptible for cq_wait conditional wait · fce831f9
      Jackie Liu 提交于
      commit fdb288a679cdf6a71f3c1ae6f348ba4dae742681 upstream.
      
      The previous patch has ensured that io_cqring_events contain
      smp_rmb memory barriers, Now we can use wait_event_interruptible
      to keep the code simple.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fce831f9
    • J
      io_uring: adjust smp_rmb inside io_cqring_events · e4fd982c
      Jackie Liu 提交于
      commit dc6ce4bc2b355a47f225a0205046b3ebf29a7f72 upstream.
      
      Whenever smp_rmb is required to use io_cqring_events,
      keep smp_rmb inside the function io_cqring_events.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e4fd982c
    • R
      io_uring: fix infinite wait in khread_park() on io_finish_async() · a35d7922
      Roman Penyaev 提交于
      commit 2bbcd6d3b36a75a19be4917807f54ae32dd26aba upstream.
      
      This fixes couple of races which lead to infinite wait of park completion
      with the following backtraces:
      
        [20801.303319] Call Trace:
        [20801.303321]  ? __schedule+0x284/0x650
        [20801.303323]  schedule+0x33/0xc0
        [20801.303324]  schedule_timeout+0x1bc/0x210
        [20801.303326]  ? schedule+0x3d/0xc0
        [20801.303327]  ? schedule_timeout+0x1bc/0x210
        [20801.303329]  ? preempt_count_add+0x79/0xb0
        [20801.303330]  wait_for_completion+0xa5/0x120
        [20801.303331]  ? wake_up_q+0x70/0x70
        [20801.303333]  kthread_park+0x48/0x80
        [20801.303335]  io_finish_async+0x2c/0x70
        [20801.303336]  io_ring_ctx_wait_and_kill+0x95/0x180
        [20801.303338]  io_uring_release+0x1c/0x20
        [20801.303339]  __fput+0xad/0x210
        [20801.303341]  task_work_run+0x8f/0xb0
        [20801.303342]  exit_to_usermode_loop+0xa0/0xb0
        [20801.303343]  do_syscall_64+0xe0/0x100
        [20801.303349]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [20801.303380] Call Trace:
        [20801.303383]  ? __schedule+0x284/0x650
        [20801.303384]  schedule+0x33/0xc0
        [20801.303386]  io_sq_thread+0x38a/0x410
        [20801.303388]  ? __switch_to_asm+0x40/0x70
        [20801.303390]  ? wait_woken+0x80/0x80
        [20801.303392]  ? _raw_spin_lock_irqsave+0x17/0x40
        [20801.303394]  ? io_submit_sqes+0x120/0x120
        [20801.303395]  kthread+0x112/0x130
        [20801.303396]  ? kthread_create_on_node+0x60/0x60
        [20801.303398]  ret_from_fork+0x35/0x40
      
       o kthread_park() waits for park completion, so io_sq_thread() loop
         should check kthread_should_park() along with khread_should_stop(),
         otherwise if kthread_park() is called before prepare_to_wait()
         the following schedule() never returns:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
      
            ctx->sqo_stop = 1;
            kthread_park()
      
      	                            prepare_to_wait();
                                          if (kthread_should_stop() {
      				    }
                                          schedule();   <<< nobody checks park flag,
      				                  <<< so schedule and never return
      
       o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
         it is quite possible, that kthread_should_park() check and the
         following kthread_parkme() is never called, because kthread_park()
         has not been yet called, but few moments later is is called and
         waits there for park completion, which never happens, because
         kthread has already exited:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
            ctx->sqo_stop = 1;
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
                                         <<< observe sqo_stop and exit the loop
      			       }
      
      			       if (kthread_should_park())
      			           kthread_parkme();  <<< never called, since was
      					              <<< never parked
      
            kthread_park()           <<< waits forever for park completion
      
      In the current patch we quit the loop by only kthread_should_park()
      check (kthread_park() is synchronous, so kthread_should_stop() is
      never observed), and we abandon ->sqo_stop flag, since it is racy.
      At the end of the io_sq_thread() we unconditionally call parmke(),
      since we've exited the loop by the park flag.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a35d7922
    • J
      io_uring: remove 'ev_flags' argument · 8c599bf7
      Jens Axboe 提交于
      commit c71ffb673cd9bb2ddc575ede9055f265b2535690 upstream.
      
      We always pass in 0 for the cqe flags argument, since the support for
      "this read hit page cache" hint was dropped.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8c599bf7
    • J
      io_uring: fix failure to verify SQ_AFF cpu · b9dfcf6a
      Jens Axboe 提交于
      commit 44a9bd18a0f06bba19d155aeaa11e2edce898293 upstream.
      
      The test case we have is rightfully failing with the current kernel:
      
      io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
      expected -1, got 3
      
      This is in a vm, and CPU3 is the last valid one, hence asking for 4
      should fail the setup with -EINVAL, not succeed. The problem is that
      we're using array_index_nospec() with nr_cpu_ids as the index, hence we
      wrap and end up using CPU0 instead of CPU4. This makes the setup
      succeed where it should be failing.
      
      We don't need to use array_index_nospec() as we're not indexing any
      array with this. Instead just compare with nr_cpu_ids directly. This
      is fine as we're checking with cpu_online() afterwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b9dfcf6a
    • S
      io_uring: fix race condition reading SQE data · 8f2cc0e9
      Stefan Bühler 提交于
      commit e2033e33cb3821c26d4f9e70677910827d3b7885 upstream.
      
      When punting to workers the SQE gets copied after the initial try.
      There is a race condition between reading SQE data for the initial try
      and copying it for punting it to the workers.
      
      For example io_rw_done calls kiocb->ki_complete even if it was prepared
      for IORING_OP_FSYNC (and would be NULL).
      
      The easiest solution for now is to alway prepare again in the worker.
      
      req->file is safe to prepare though as long as it is checked before use.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8f2cc0e9
    • S
      io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible() · b5ac46ff
      Shenghui Wang 提交于
      commit 7889f44dd9cee15aff1c3f7daf81ca4dfed48fc7 upstream.
      
      This issue is found by running liburing/test/io_uring_setup test.
      
      When test run, the testcase "attempt to bind to invalid cpu" would not
      pass with messages like:
         io_uring_setup(1, 0xbfc2f7c8), \
      flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
      resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
      sq_thread_cpu: 2
         expected -1, got 3
         FAIL
      
      On my system, there is:
         CPU(s) possible : 0-3
         CPU(s) online   : 0-1
         CPU(s) offline  : 2-3
         CPU(s) present  : 0-1
      
      The sq_thread_cpu 2 is offline on my system, so the bind should fail.
      But cpu_possible() will pass the check. We shouldn't be able to bind
      to an offline cpu. Use cpu_online() to do the check.
      
      After the change, the testcase run as expected: EINVAL will be returned
      for cpu offlined.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b5ac46ff
    • C
      io_uring: fix shadowed variable ret return code being not checked · 91ee101a
      Colin Ian King 提交于
      commit efeb862bd5bc001636e690debf6f9fbba98e5bfd upstream.
      
      Currently variable ret is declared in a while-loop code block that
      shadows another variable ret. When an error occurs in the while-loop
      the error return in ret is not being set in the outer code block and
      so the error check on ret is always going to be checking on the wrong
      ret variable resulting in check that is always going to be true and
      a premature return occurs.
      
      Fix this by removing the declaration of the inner while-loop variable
      ret so that shadowing does not occur.
      
      Addresses-Coverity: ("'Constant' variable guards dead code")
      Fixes: 6b06314c47e1 ("io_uring: add file set registration")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      91ee101a
    • S
      req->error only used for iopoll · f44bc1f1
      Stefan Bühler 提交于
      commit 5dcf877fb13f3c6a8ba0777ef766c4af32df725d upstream.
      
      No need to set it in io_poll_add; io_poll_complete doesn't use it to set
      the result in the CQE.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f44bc1f1
    • J
      io_uring: add support for eventfd notifications · 3707251b
      Jens Axboe 提交于
      commit 9b402849e80c85eee10bbd341aab3f1a0f942d4f upstream.
      
      Allow registration of an eventfd, which will trigger an event every
      time a completion event happens for this io_uring instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3707251b
    • J
      io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 2c12f33e
      Jens Axboe 提交于
      commit 5d17b4a4b7fa172b205be8a05051ae705d1dc3bb upstream.
      
      This behaves just like sync_file_range(2) does.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2c12f33e
    • J
      fs: add sync_file_range() helper · ceda0208
      Jens Axboe 提交于
      commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 upstream.
      
      This just pulls out the ksys_sync_file_range() code to work on a struct
      file instead of an fd, so we can use it elsewhere.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ceda0208
    • J
      io_uring: add support for marking commands as draining · b52f2397
      Jens Axboe 提交于
      commit de0617e467171ba44c73efd1ba63f101b164a035 upstream.
      
      There are no ordering constraints between the submission and completion
      side of io_uring. But sometimes that would be useful to have. One common
      example is doing an fsync, for instance, and have it ordered with
      previous writes. Without support for that, the application must do this
      tracking itself.
      
      This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
      with this flag, then it will not be issued before previous commands have
      completed, and subsequent commands submitted after the drain will not be
      issued before the drain is started.. If there are no pending commands,
      setting this flag will not change the behavior of the issue of the
      command.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b52f2397
    • J
      alinux: jbd2: fix build warnings · 54f827ce
      Joseph Qi 提交于
      Fix the following build warnings:
        fs/jbd2/transaction.o: In function `jbd2_journal_stop':
        (.text+0x2934): undefined reference to `__udivdi3'
        (.text+0x2970): undefined reference to `__udivdi3'
      
      Fixes: 861575c9 ("alinux: jbd2: track slow handle which is preventing transaction committing")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54f827ce
    • S
      drm/amdgpu/gmc: fix compiler errors [-Werror,-Wmissing-braces] (V2) · 5709b748
      Shirish S 提交于
      commit 05794eff1aa6060248bfca34ee936c613f94a942 upstream.
      
      Initializing structures with { } is known to be problematic since
      it doesn't necessararily initialize all bytes, in case of padding,
      causing random failures when structures are memcmp().
      
      This patch fixes the structure initialisation related compiler
      error by memset().
      
      V2: rectified missing piece in coding
      Signed-off-by: NShirish S <shirish.s@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5709b748
    • L
      ACPI/IORT: Rename arm_smmu_v3_set_proximity() 'node' local variable · 8df5902b
      Lorenzo Pieralisi 提交于
      commit 3e77eeb7a27fc3dcf6b65e7ee01ac00bf5d2b4fb upstream.
      
      Commit 36a2ba07757d ("ACPI/IORT: Reject platform device creation on NUMA
      node mapping failure") introduced a local variable 'node' in
      arm_smmu_v3_set_proximity() that shadows the struct acpi_iort_node
      pointer function parameter.
      
      Execution was unaffected but it is prone to errors and can lead
      to subtle bugs.
      
      Rename the local variable to prevent any issue.
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reported-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      8df5902b
    • Z
      irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map · bfc5299f
      Zenghui Yu 提交于
      commit 342be1068d9b5b1fd364d270b4f731764e23de2b upstream.
      
      We try to find a free LPI region in device's lpi_map and allocate them
      (set them to 1) when we want to allocate LPIs for this device. This is
      what bitmap_find_free_region() has done for us. The following set_bit
      is redundant and a bit confusing (since we only set_bit against the first
      allocated LPI idx). Remove it, and make the set_bit explicit by comment.
      Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      bfc5299f
    • S
      perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 54c387a7
      Shameer Kolothum 提交于
      commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream
      
      HiSilicon erratum 162001800 describes the limitation of
      SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.
      
      On these platforms, the PMCG event counter registers
      (SMMU_PMCG_EVCNTRn) are read only and as a result it
      is not possible to set the initial counter period value
      on event monitor start.
      
      To work around this, the current value of the counter
      is read and used for delta calculations. OEM information
      from ACPI header is used to identify the affected hardware
      platforms.
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      [will: update silicon-errata.txt and add reason string to acpi match]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      54c387a7
    • S
      perf/smmuv3: Add MSI irq support · 1aa99077
      Shameer Kolothum 提交于
      commit f202cdab3b48d8c2c1846c938ea69cb8aa897699 upstream
      
      This adds support for MSI-based counter overflow interrupt.
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1aa99077
    • N
      perf/smmuv3: Add arm64 smmuv3 pmu driver · 47a6fe19
      Neil Leeder 提交于
      commit 7d839b4b9e00645e49345d6ce5dfa8edf53c1a21 upstream
      
      Adds a new driver to support the SMMUv3 PMU and add it into the
      perf events framework.
      
      Each SMMU node may have multiple PMUs associated with it, each of
      which may support different events.
      
      SMMUv3 PMCG devices are named as smmuv3_pmcg_<phys_addr_page> where
      <phys_addr_page> is the physical page address of the SMMU PMCG
      wrapped to 4K boundary. For example, the PMCG at 0xff88840000 is
      named smmuv3_pmcg_ff88840
      
      Filtering by stream id is done by specifying filtering parameters
      with the event. options are:
         filter_enable    - 0 = no filtering, 1 = filtering enabled
         filter_span      - 0 = exact match, 1 = pattern match
         filter_stream_id - pattern to filter against
      
      Example: perf stat -e smmuv3_pmcg_ff88840/transaction,filter_enable=1,
                             filter_span=1,filter_stream_id=0x42/ -a netperf
      
      Applies filter pattern 0x42 to transaction events, which means events
      matching stream ids 0x42 & 0x43 are counted as only upper StreamID
      bits are required to match the given filter. Further filtering
      information is available in the SMMU documentation.
      
      SMMU events are not attributable to a CPU, so task mode and sampling
      are not supported.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      [will: fold in review feedback from Robin]
      [will: rewrite Kconfig text and allow building as a module]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      47a6fe19
    • N
      ACPI/IORT: Add support for PMCG · 9c6dfb51
      Neil Leeder 提交于
      commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.
      
      Add support for the SMMU Performance Monitor Counter Group
      information from ACPI. This is in preparation for its use
      in the SMMUv3 PMU driver.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9c6dfb51
    • G
      iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages() · 613ca475
      Ganapatrao Kulkarni 提交于
      commit c4b17afb0a4e8d042320efaf2acf55cb26795f78 upstream.
      
      Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
      respective device NUMA node. The ternary operator which would be for
      alloc_pages_node() is tidied along with this.
      
      The motivation for this change is to have a policy for page allocation
      consistent with direct DMA mapping, which attempts to allocate pages local
      to the device, as mentioned in [1].
      
      In addition, for certain workloads it has been observed a marginal
      performance improvement. The patch caused an observation of 0.9% average
      throughput improvement for running tcrypt with HiSilicon crypto engine.
      
      We also include a modification to use kvzalloc() for kzalloc()/vzalloc()
      combination.
      
      [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.htmlSigned-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
      [JPG: Added kvzalloc(), drop pages ** being device local, remove ternary operator, update message]
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      613ca475
    • P
      mm/hotplug: make remove_memory() interface usable · f8502f80
      Pavel Tatashin 提交于
      commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream
      
      Presently the remove_memory() interface is inherently broken.  It tries
      to remove memory but panics if some memory is not offline.  The problem
      is that it is impossible to ensure that all memory blocks are offline as
      this function also takes lock_device_hotplug that is required to change
      memory state via sysfs.
      
      So, between calling this function and offlining all memory blocks there
      is always a window when lock_device_hotplug is released, and therefore,
      there is always a chance for a panic during this window.
      
      Make this interface to return an error if memory removal fails.  This
      way it is safe to call this function without panicking machine, and also
      makes it symmetric to add_memory() which already returns an error.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      f8502f80
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d2097173
      David Hildenbrand 提交于
      commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream
      
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d2097173
    • X
      alinux: mm: kidled: fix frame-larger-than build warning · 25f9e572
      Xu Yu 提交于
      This fix the following build warning:
      
      mm/memcontrol.c: In function 'mem_cgroup_idle_page_stats_show':
      mm/memcontrol.c:3866:1: warning: the frame size of 2160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      
      The root cause is that "mem_cgroup_idle_page_stats_show" has two
      "struct idle_page_stats" variables, each of which is 1056 bytes in
      size, on the stack, thus exceeding the 2048 max frame size.
      
      This fix the build warning by dynamically allocating memory to these two
      variables with kmalloc.
      
      Fixes: a29243e2 ("alinux: mm: Support kidled")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      25f9e572
    • A
      mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · f38de7b3
      Alexander Duyck 提交于
      commit 0e56acae4b4dd4a9fbe897854ab83a109e2a9e11 upstream.
      
      Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
      use it to support initializing and freeing pages in groups no larger than
      MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
      locality of the pages while we do several loops over them in the init and
      freeing process.
      
      We are able to tighten the loops further as a result of the "from"
      iterator as we can perform the initial checks for first_init_pfn in our
      first call to the iterator, and continue without the need for those checks
      via the "from" iterator.  I have added this functionality in the function
      called deferred_init_mem_pfn_range_in_zone that primes the iterator and
      causes us to exit if we encounter any failure.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 1.85s to 1.38s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f38de7b3
    • A
      mm: implement new zone specific memblock iterator · ad97e5e4
      Alexander Duyck 提交于
      commit 837566e7e08e3f89444166444836a8a49b9f9322 upstream.
      
      Introduce a new iterator for_each_free_mem_pfn_range_in_zone.
      
      This iterator will take care of making sure a given memory range provided
      is in fact contained within a zone.  It takes are of all the bounds
      checking we were doing in deferred_grow_zone, and deferred_init_memmap.
      In addition it should help to speed up the search a bit by iterating until
      the end of a range is greater than the start of the zone pfn range, and
      will exit completely if the start is beyond the end of the zone.
      
      Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ad97e5e4
    • A
      mm: drop meminit_pfn_in_nid as it is redundant · b065ceca
      Alexander Duyck 提交于
      commit 56ec43d8b02719402c9fcf984feb52ec2300f8a5 upstream.
      
      As best as I can tell the meminit_pfn_in_nid call is completely redundant.
      The deferred memory initialization is already making use of
      for_each_free_mem_range which in turn will call into __next_mem_range
      which will only return a memory range if it matches the node ID provided
      assuming it is not NUMA_NO_NODE.
      
      I am operating on the assumption that there are no zones or pgdata_t
      structures that have a NUMA node of NUMA_NO_NODE associated with them.  If
      that is the case then __next_mem_range will never return a memory range
      that doesn't match the zone's node ID and as such the check is redundant.
      
      So one piece I would like to verify on this is if this works for ia64.
      Technically it was using a different approach to get the node ID, but it
      seems to have the node ID also encoded into the memblock.  So I am
      assuming this is okay, but would like to get confirmation on that.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 2.80s to 1.85s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b065ceca
    • A
      mm: use mm_zero_struct_page from SPARC on all 64b architectures · e23b0cb5
      Alexander Duyck 提交于
      commit 5470dea49f5382257c242ac617d908267727f1a8 upstream.
      
      Patch series "Deferred page init improvements", v7.
      
      This patchset is essentially a refactor of the page initialization logic
      that is meant to provide for better code reuse while providing a
      significant improvement in deferred page initialization performance.
      
      In my testing on an x86_64 system with 384GB of RAM I have seen the
      following.  In the case of regular memory initialization the deferred init
      time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
      improvement for the deferred memory initialization performance.
      
      I have called out the improvement observed with each patch.
      
      This patch (of 4):
      
      Use the same approach that was already in use on Sparc on all the
      architectures that support a 64b long.
      
      This is mostly motivated by the fact that 7 to 10 store/move instructions
      are likely always going to be faster than having to call into a function
      that is not specialized for handling page init.
      
      An added advantage to doing it this way is that the compiler can get away
      with combining writes in the __init_single_page call.  As a result the
      memset call will be reduced to only about 4 write operations, or at least
      that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
      count/mapcount seem to be cancelling out at least 4 of the 8 assignments
      on my system.
      
      One change I had to make to the function was to reduce the minimum page
      size to 56 to support some powerpc64 configurations.
      
      This change should introduce no change on SPARC since it already had this
      code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
      initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
      with Broadcom's Stingray CPU and 48GB of RAM and found that
      __init_single_page() takes 19.30ns / 64-byte struct page before this patch
      and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
      ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
      and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
      and 4.59ns after this patch.
      
      Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e23b0cb5
    • C
      nvme-mpath: remove I/O polling support · 5501fe75
      Christoph Hellwig 提交于
      commit 9d6610b76fa374eae3deb93bcbace4a06c2e3b95 upstream.
      
      The ->poll_fn has been stale for a while, as a lot of places check for mq
      ops.  But there is no real point in it anyway, as we don't even use
      the multipath code for subsystems without multiple ports, which is usually
      what we do high performance I/O to.  If it really becomes an issue we
      should rework the nvme code to also skip the multipath code for any
      private namespace, even if that could mean some trouble when rescanning.
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba-inc.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5501fe75
    • D
      amd-gpu: Don't undefine READ and WRITE · fb7ad259
      David Howells 提交于
      commit 1fcb748d187d0c7732a75a509e924ead6d070e04 upstream.
      
      Remove the undefinition of READ and WRITE because these constants may be
      used elsewhere in subsequently included header files, thus breaking them.
      
      These constants don't actually appear to be used in the driver, so the
      undefinition seems pointless.
      
      Fixes: 4562236b ("drm/amd/dc: Add dc display driver (v2)")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fb7ad259
    • M
      blk-mq: grab .q_usage_counter when queuing request from plug code path · 4a55f77f
      Ming Lei 提交于
      commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream.
      
      Just like aio/io_uring, we need to grab 2 refcount for queuing one
      request, one is for submission, another is for completion.
      
      If the request isn't queued from plug code path, the refcount grabbed
      in generic_make_request() serves for submission. In theroy, this
      refcount should have been released after the sumission(async run queue)
      is done. blk_freeze_queue() works with blk_sync_queue() together
      for avoiding race between cleanup queue and IO submission, given async
      run queue activities are canceled because hctx->run_work is scheduled with
      the refcount held, so it is fine to not hold the refcount when
      running the run queue work function for dispatch IO.
      
      However, if request is staggered into plug list, and finally queued
      from plug code path, the refcount in submission side is actually missed.
      And we may start to run queue after queue is removed because the queue's
      kobject refcount isn't guaranteed to be grabbed in flushing plug list
      context, then kernel oops is triggered, see the following race:
      
      blk_mq_flush_plug_list():
              blk_mq_sched_insert_requests()
                      insert requests to sw queue or scheduler queue
                      blk_mq_run_hw_queue
      
      Because of concurrent run queue, all requests inserted above may be
      completed before calling the above blk_mq_run_hw_queue. Then queue can
      be freed during the above blk_mq_run_hw_queue().
      
      Fixes the issue by grab .q_usage_counter before calling
      blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
      safe because the queue is absolutely alive before inserting request.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: use the passing 'q' directly]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4a55f77f