1. 19 10月, 2020 12 次提交
  2. 16 10月, 2020 1 次提交
  3. 15 10月, 2020 11 次提交
  4. 13 10月, 2020 16 次提交
    • M
      blk-mq: insert flush request to the front of dispatch queue · d76d43fb
      Ming Lei 提交于
      mainline inclusion
      from mainline-5.6-rc6
      commit cc3200ea
      category: bugfix
      bugzilla: 42777
      CVE: NA
      
      ---------------------------
      
      commit 01e99aec ("blk-mq: insert passthrough request into
      hctx->dispatch directly") may change to add flush request to the tail
      of dispatch by applying the 'add_head' parameter of
      blk_mq_sched_insert_request.
      
      Turns out this way causes performance regression on NCQ controller because
      flush is non-NCQ command, which can't be queued when there is any in-flight
      NCQ command. When adding flush rq to the front of hctx->dispatch, it is
      easier to introduce extra time to flush rq's latency compared with adding
      to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
      flush merge is increased, and less flush requests may be issued to
      controller.
      
      So always insert flush request to the front of dispatch queue just like
      before applying commit 01e99aec ("blk-mq: insert passthrough request
      into hctx->dispatch directly").
      
      Cc: Damien Le Moal <Damien.LeMoal@wdc.com>
      Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reported-by: NShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d76d43fb
    • D
      blk-mq: Rerun dispatching in the case of budget contention · 8083c82d
      Douglas Anderson 提交于
      mainline inclusion
      from mainline-5.8-rc1
      commit a0823421
      category: bugfix
      bugzilla: 42781
      CVE: NA
      
      ---------------------------
      
      If ever a thread running blk-mq code tries to get budget and fails it
      immediately stops doing work and assumes that whenever budget is freed
      up that queues will be kicked and whatever work the thread was trying
      to do will be tried again.
      
      One path where budget is freed and queues are kicked in the normal
      case can be seen in scsi_finish_command().  Specifically:
      - scsi_finish_command()
        - scsi_device_unbusy()
          - # Decrement "device_busy", AKA release budget
        - scsi_io_completion()
          - scsi_end_request()
            - blk_mq_run_hw_queues()
      
      The above is all well and good.  The problem comes up when a thread
      claims the budget but then releases it without actually dispatching
      any work.  Since we didn't schedule any work we'll never run the path
      of finishing work / kicking the queues.
      
      This isn't often actually a problem which is why this issue has
      existed for a while and nobody noticed.  Specifically we only get into
      this situation when we unexpectedly found that we weren't going to do
      any work.  Code that later receives new work kicks the queues.  All
      good, right?
      
      The problem shows up, however, if timing is just wrong and we hit a
      race.  To see this race let's think about the case where we only have
      a budget of 1 (only one thread can hold budget).  Now imagine that a
      thread got budget and then decided not to dispatch work.  It's about
      to call put_budget() but then the thread gets context switched out for
      a long, long time.  While in this state, any and all kicks of the
      queue (like the when we received new work) will be no-ops because
      nobody can get budget.  Finally the thread holding budget gets to run
      again and returns.  All the normal kicks will have been no-ops and we
      have an I/O stall.
      
      As you can see from the above, you need just the right timing to see
      the race.  To start with, the only case it happens if we thought we
      had work, actually managed to get the budget, but then actually didn't
      have work.  That's pretty rare to start with.  Even then, there's
      usually a very small amount of time between realizing that there's no
      work and putting the budget.  During this small amount of time new
      work has to come in and the queue kick has to make it all the way to
      trying to get the budget and fail.  It's pretty unlikely.
      
      One case where this could have failed is illustrated by an example of
      threads running blk_mq_do_dispatch_sched():
      
      * Threads A and B both run has_work() at the same time with the same
        "hctx".  Imagine has_work() is exact.  There's no lock, so it's OK
        if Thread A and B both get back true.
      * Thread B gets interrupted for a long time right after it decides
        that there is work.  Maybe its CPU gets an interrupt and the
        interrupt handler is slow.
      * Thread A runs, get budget, dispatches work.
      * Thread A's work finishes and budget is released.
      * Thread B finally runs again and gets budget.
      * Since Thread A already took care of the work and no new work has
        come in, Thread B will get NULL from dispatch_request().  I believe
        this is specifically why dispatch_request() is allowed to return
        NULL in the first place if has_work() must be exact.
      * Thread B will now be holding the budget and is about to call
        put_budget(), but hasn't called it yet.
      * Thread B gets interrupted for a long time (again).  Dang interrupts.
      * Now Thread C (maybe with a different "hctx" but the same queue)
        comes along and runs blk_mq_do_dispatch_sched().
      * Thread C won't do anything because it can't get budget.
      * Finally Thread B will run again and put the budget without kicking
        any queues.
      
      Even though the example above is with blk_mq_do_dispatch_sched() I
      believe the race is possible any time someone is holding budget but
      doesn't do work.
      
      Unfortunately, the unlikely has become more likely if you happen to be
      using the BFQ I/O scheduler.  BFQ, by design, sometimes returns "true"
      for has_work() but then NULL for dispatch_request() and stays in this
      state for a while (currently up to 9 ms).  Suddenly you only need one
      race to hit, not two races in a row.  With my current setup this is
      easy to reproduce in reboot tests and traces have actually shown that
      we hit a race similar to the one described above.
      
      Note that we only need to fix blk_mq_do_dispatch_sched() and
      blk_mq_do_dispatch_ctx() and not the other places that put budget.  In
      other cases we know that we have work to do on at least one "hctx" and
      code already exists to kick that "hctx"'s queue.  When that work
      finally finishes all the queues will be kicked using the normal flow.
      
      One last note is that (at least in the SCSI case) budget is shared by
      all "hctx"s that have the same queue.  Thus we need to make sure to
      kick the whole queue, not just re-run dispatching on a single "hctx".
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8083c82d
    • D
      blk-mq: Add blk_mq_delay_run_hw_queues() API call · 9f7b6ce9
      Douglas Anderson 提交于
      mainline inclusion
      from mainline-5.8-rc1
      commit b9151e7b
      category: bugfix
      bugzilla: 42781
      CVE: NA
      
      ---------------------------
      
      We have:
      * blk_mq_run_hw_queue()
      * blk_mq_delay_run_hw_queue()
      * blk_mq_run_hw_queues()
      
      ...but not blk_mq_delay_run_hw_queues(), presumably because nobody
      needed it before now.  Since we need it for a later patch in this
      series, add it.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9f7b6ce9
    • D
      blk-mq: In blk_mq_dispatch_rq_list() "no budget" is a reason to kick · ee4d180f
      Douglas Anderson 提交于
      mainline inclusion
      from mainline-5.8-rc1
      commit ab3cee37
      category: bugfix
      bugzilla: 42780
      CVE: NA
      
      ---------------------------
      
      In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
      true and the driver returns BLK_STS_RESOURCE then we'll kick the
      queue.  However, there's another case where we might need to kick it.
      If we were unable to get budget we can be in much the same state as
      when the driver returns BLK_STS_RESOURCE, so we should treat it the
      same.
      
      It should be noted that even if we add a whole bunch of extra kicking
      to the queue in other patches this patch is still important.
      Specifically any kicking that happened before we re-spliced leftover
      requests into 'hctx->dispatch' wouldn't have found any work, so we
      really need to make sure we kick ourselves after we've done the
      splicing.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ee4d180f
    • J
      blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budget · 676b143f
      John Garry 提交于
      mainline inclusion
      from mainline-5.7-rc2
      commit 5fe56de7
      category: bugfix
      bugzilla: 42779
      CVE: NA
      
      ---------------------------
      
      If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
      dispatch loop, but the request may keep the driver tag, evaulated
      in 'nxt' in the previous loop iteration.
      
      Fix by putting the driver tag for that request.
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      
      Conflict: block/blk-mq.c
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      676b143f
    • M
      blk-mq: insert passthrough request into hctx->dispatch directly · 1d95b60d
      Ming Lei 提交于
      mainline inclusion
      from mainline-5.6-rc4
      commit 01e99aec
      category: bugfix
      bugzilla: 42777
      CVE: NA
      
      ---------------------------
      
      For some reason, device may be in one situation which can't handle
      FS request, so STS_RESOURCE is always returned and the FS request
      will be added to hctx->dispatch. However passthrough request may
      be required at that time for fixing the problem. If passthrough
      request is added to scheduler queue, there isn't any chance for
      blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
      Then the FS IO request may never be completed, and IO hang is caused.
      
      So passthrough request has to be added to hctx->dispatch directly
      for fixing the IO hang.
      
      Fix this issue by inserting passthrough request into hctx->dispatch
      directly together withing adding FS request to the tail of
      hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
      to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().
      
      Then it becomes consistent with original legacy IO request
      path, in which passthrough request is always added to q->queue_head.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      
      Conflicts:
        block/blk-flush.c
        block/blk-mq.c
        block/blk-mq-sched.c
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1d95b60d
    • F
      arm64/ascend: Fix register_persistent_clock definition · 8108557d
      Fang Lijun 提交于
      ascend inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      The register_persistent_clock will be called after kernel init,
      so it can not be defined as __init.
      
      Fixes: 76ab899d73d6 ("arm64/ascend: Implement the read_persistend_clock64 for aarch64")
      Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8108557d
    • G
      Linux 4.19.150 · 1d41cd90
      Greg Kroah-Hartman 提交于
      Merge 38 patches from 4.19.150 stable
      branch (39 total) beside 1 already merged patches:
      1c3886dc3023 net/packet: fix overflow in tpacket_rcv
      Tested-by: NJon Hunter <jonathanh@nvidia.com>
      Tested-by: NShuah Khan <skhan@linuxfoundation.org>
      Tested-by: NLinux Kernel Functional Testing <lkft@linaro.org>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20201005142108.650363140@linuxfoundation.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1d41cd90
    • W
      netfilter: ctnetlink: add a range check for l3/l4 protonum · 1e44b5bf
      Will McVicker 提交于
      commit 1cc5ef91 upstream.
      
      The indexes to the nf_nat_l[34]protos arrays come from userspace. So
      check the tuple's family, e.g. l3num, when creating the conntrack in
      order to prevent an OOB memory access during setup.  Here is an example
      kernel panic on 4.14.180 when userspace passes in an index greater than
      NFPROTO_NUMPROTO.
      
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      Modules linked in:...
      Process poc (pid: 5614, stack limit = 0x00000000a3933121)
      CPU: 4 PID: 5614 Comm: poc Tainted: G S      W  O    4.14.180-g051355490483
      Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM
      task: 000000002a3dfffe task.stack: 00000000a3933121
      pc : __cfi_check_fail+0x1c/0x24
      lr : __cfi_check_fail+0x1c/0x24
      ...
      Call trace:
      __cfi_check_fail+0x1c/0x24
      name_to_dev_t+0x0/0x468
      nfnetlink_parse_nat_setup+0x234/0x258
      ctnetlink_parse_nat_setup+0x4c/0x228
      ctnetlink_new_conntrack+0x590/0xc40
      nfnetlink_rcv_msg+0x31c/0x4d4
      netlink_rcv_skb+0x100/0x184
      nfnetlink_rcv+0xf4/0x180
      netlink_unicast+0x360/0x770
      netlink_sendmsg+0x5a0/0x6a4
      ___sys_sendmsg+0x314/0x46c
      SyS_sendmsg+0xb4/0x108
      el0_svc_naked+0x34/0x38
      
      This crash is not happening since 5.4+, however, ctnetlink still
      allows for creating entries with unsupported layer 3 protocol number.
      
      Fixes: c1d10adb ("[NETFILTER]: Add ctnetlink port for nf_conntrack")
      Signed-off-by: NWill McVicker <willmcvicker@google.com>
      [pablo@netfilter.org: rebased original patch on top of nf.git]
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1e44b5bf
    • A
      ep_create_wakeup_source(): dentry name can change under you... · b11afd4d
      Al Viro 提交于
      commit 3701cb59 upstream.
      
      or get freed, for that matter, if it's a long (separately stored)
      name.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b11afd4d
    • A
      epoll: EPOLL_CTL_ADD: close the race in decision to take fast path · 333582e0
      Al Viro 提交于
      commit fe0a916c upstream.
      
      Checking for the lack of epitems refering to the epoll we want to insert into
      is not enough; we might have an insertion of that epoll into another one that
      has already collected the set of files to recheck for excessive reverse paths,
      but hasn't gotten to creating/inserting the epitem for it.
      
      However, any such insertion in progress can be detected - it will update the
      generation count in our epoll when it's done looking through it for files
      to check.  That gets done under ->mtx of our epoll and that allows us to
      detect that safely.
      
      We are *not* holding epmutex here, so the generation count is not stable.
      However, since both the update of ep->gen by loop check and (later)
      insertion into ->f_ep_link are done with ep->mtx held, we are fine -
      the sequence is
      	grab epmutex
      	bump loop_check_gen
      	...
      	grab tep->mtx		// 1
      	tep->gen = loop_check_gen
      	...
      	drop tep->mtx		// 2
      	...
      	grab tep->mtx		// 3
      	...
      	insert into ->f_ep_link
      	...
      	drop tep->mtx		// 4
      	bump loop_check_gen
      	drop epmutex
      and if the fastpath check in another thread happens for that
      eventpoll, it can come
      	* before (1) - in that case fastpath is just fine
      	* after (4) - we'll see non-empty ->f_ep_link, slow path
      taken
      	* between (2) and (3) - loop_check_gen is stable,
      with ->mtx providing barriers and we end up taking slow path.
      
      Note that ->f_ep_link emptiness check is slightly racy - we are protected
      against insertions into that list, but removals can happen right under us.
      Not a problem - in the worst case we'll end up taking a slow path for
      no good reason.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      333582e0
    • A
      epoll: replace ->visited/visited_list with generation count · e7beed09
      Al Viro 提交于
      commit 18306c40 upstream.
      
      removes the need to clear it, along with the races.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e7beed09
    • A
    • L
      mm: don't rely on system state to detect hot-plug operations · 29b493b4
      Laurent Dufour 提交于
      commit f85086f9 upstream.
      
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      29b493b4
    • L
      mm: replace memmap_context by meminit_context · bf133f4f
      Laurent Dufour 提交于
      commit c1d0da83 upstream.
      
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bf133f4f
    • T
      random32: Restore __latent_entropy attribute on net_rand_state · 88ecb228
      Thibaut Sautereau 提交于
      [ Upstream commit 09a6b0bc ]
      
      Commit f227e3ec ("random32: update the net random state on interrupt
      and activity") broke compilation and was temporarily fixed by Linus in
      83bdc727 ("random32: remove net_rand_state from the latent entropy
      gcc plugin") by entirely moving net_rand_state out of the things handled
      by the latent_entropy GCC plugin.
      
      From what I understand when reading the plugin code, using the
      __latent_entropy attribute on a declaration was the wrong part and
      simply keeping the __latent_entropy attribute on the variable definition
      was the correct fix.
      
      Fixes: 83bdc727 ("random32: remove net_rand_state from the latent entropy gcc plugin")
      Acked-by: NWilly Tarreau <w@1wt.eu>
      Cc: Emese Revfy <re.emese@gmail.com>
      Signed-off-by: NThibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      88ecb228