1. 18 3月, 2020 40 次提交
    • D
      blk-mq: Add a NULL check in blk_mq_free_map_and_requests() · 5f0efd1f
      Dan Carpenter 提交于
      commit 4e6db0f21c99c25980c8d183f95cdb6ad64cebd2 upstream.
      
      I recently found some code which called blk_mq_free_map_and_requests()
      with a NULL set->tags pointer.  I fixed the caller, but it seems like a
      good idea to add a NULL check here as well.  Now we can call:
      
      	blk_mq_free_tag_set(set);
      	blk_mq_free_tag_set(set);
      
      twice in a row and it's harmless.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5f0efd1f
    • X
      blk-mq: place trace_block_getrq() in correct place · d23f53bf
      Xiaoguang Wang 提交于
      commit d6f1dda27251909a27b8d8aacb498628a1047978 upstream.
      
      trace_block_getrq() is to indicate a request struct has been allocated
      for queue, so put it in right place.
      Reviewed-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d23f53bf
    • G
      blk-mq: protect debugfs_create_files() from failures · 9ec13b19
      Greg Kroah-Hartman 提交于
      commit 36991ca68db9dd43bac7f3519f080ee3939263ef upstream.
      
      If debugfs were to return a non-NULL error for a debugfs call, using
      that pointer later in debugfs_create_files() would crash.
      
      Fix that by properly checking the pointer before referencing it.
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9ec13b19
    • M
      blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240
      Ming Lei 提交于
      commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.
      
      Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
      from block layer's view, actually they don't because userspace may
      grab one kobject anytime via sysfs.
      
      This patch fixes the issue by the following approach:
      
      1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
      all ctxs
      
      2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
      handler of .mq_kobj
      
      3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
      .mq_kobj is always released after all ctxs are freed.
      
      This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
      is enabled.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9ff28240
    • J
      blk-mq: fallback to previous nr_hw_queues when updating fails · 4f3484ac
      Jianchao Wang 提交于
      commit e01ad46d53b59720c6ae69963ee1756506954c85 upstream.
      
      When we try to increate the nr_hw_queues, we may fail due to
      shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
      and some entries in q->queue_hw_ctx are left with NULL. However,
      because queue map has been updated with new nr_hw_queues, some cpus
      have been mapped to hw queue which just encounters allocation failure,
      thus blk_mq_map_queue could return NULL. This will cause panic in
      following blk_mq_map_swqueue.
      
      To fix it, when increase nr_hw_queues fails, fallback to previous
      nr_hw_queues and post warning. At the same time, driver's .map_queues
      usually use completion irq affinity to map hw and cpu, fallback
      nr_hw_queues will cause lack of some cpu's map to hw, so use default
      blk_mq_map_queues to do that.
      
      Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4f3484ac
    • J
      blk-mq: realloc hctx when hw queue is mapped to another node · 0f44f194
      Jianchao Wang 提交于
      commit 34d11ffac1f56c3895dad32153abd6814452dc77 upstream.
      
      When the hw queues and mq_map are updated, a hctx could be mapped
      to a different numa node. At this moment, we need to realloc the
      hctx. If fail to do that, go on using previous hctx.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0f44f194
    • J
      blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues · 1d5199ea
      Jianchao Wang 提交于
      commit 477e19dedc9d3e1f4443a1d4ae00572a988120ea upstream.
      
      blk-mq debugfs and sysfs entries need to be removed before updating
      queue map, otherwise, we get get wrong result there. This patch fixes
      it and remove the redundant debugfs and sysfs register/unregister
      operations during __blk_mq_update_nr_hw_queues.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      1d5199ea
    • Q
      mm/memblock.c: skip kmemleak for kasan_init() · 9f093569
      Qian Cai 提交于
      commit fed84c78527009d4f799a3ed9a566502fa026d82 upstream.
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9f093569
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 83cd9d23
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      83cd9d23
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
    • X
      alinux: blk: add iohang check function · 80d6ee24
      Xiaoguang Wang 提交于
      Background:
        We do not have a dependable block layer interface to determine whether
      block device has io requests which have not been completed for somewhat
      long time. Currently we have 'in_flight' interface, it counts the number
      of I/O requests that have been issued to the device driver but have
      not yet completed, and it does not include I/O requests that are in the
      queue but not yet issued to the device driver, which means it will not
      count io requests that have been stucked in block layer.
        Also say that there are steady io requests issued to device driver,
      'in_flight' maybe always non-zero, but you could not determine whether
      there is one io request which has not been completed for too long.
      
      Solution:
        To find io requests which have not been completed for too long, here
      add 3 new inferfaces:
        /sys/block/vdb/queue/hang_threshold
      If one io request's running time has been greater than this value, count
      this io as hang.
      
        /sys/block/vdb/hang
      Show read/write io requests' hang counter.
      
        /sys/kernel/debug/block/vdb/rq_hang
      Show all hang io requests's detailed info, like below:
        ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
      ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
      .start_time_ns=140634088407, .io_start_time_ns=140634102958,
      .current_time=146497371953, .bio = ffff97db91e8e000,
      .bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
      .bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
      .bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
      .bio_pages = { ffffd096a0600a80 }}
      
      With above info, we can easily see this request's latency distribution,
      and see next patch for bio_pages's usage.
      
      Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
      and needs CONFIG_BLK_DEBUG_FS enabled.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      80d6ee24
    • S
      tpm: tpm_tis_spi: Introduce a flow control callback · 4cf2f5de
      Stephen Boyd 提交于
      commit 8ab5e82afa969b65b286d8949c12d2a64c83960c upstream.
      
      Cr50 firmware has a different flow control protocol than the one used by
      this TPM PTP SPI driver. Introduce a flow control callback so we can
      override the standard sequence with the custom one that Cr50 uses.
      
      Cc: Andrey Pronin <apronin@chromium.org>
      Cc: Duncan Laurie <dlaurie@chromium.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Alexander Steffen <Alexander.Steffen@infineon.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Tested-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      4cf2f5de
    • T
      tcp: Add snd_wnd to TCP_INFO · ecee8235
      Thomas Higdon 提交于
      commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream
      
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      ecee8235
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · 0107d737
      Thomas Higdon 提交于
      commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream
      
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      0107d737
    • X
      alinux: mm,memcg: export memory.{min,low} to cgroup v1 · 3e655c51
      Xu Yu 提交于
      Export "memory.min" and "memory.low" from cgroup v2 to v1.
      
      There is a subtle difference between v1 and v2, i.e. no task is allowed
      in intermediate memcgs under v2 hierarchy and this can make a different
      behaviour for them, it requires all the intermediate nodes having the
      memory.min|low set, we must keep this in mind when using this feature
      under v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      3e655c51
    • X
      alinux: mm,memcg: export memory.{events,events.local} to v1 · 94285517
      Xu Yu 提交于
      Export "memory.events" and "memory.events.local" from cgroup v2 to v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      94285517
    • R
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 5213c279
      Roman Gushchin 提交于
      commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b upstream.
      
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5213c279
    • S
      mm, memcg: introduce memory.events.local · de7ca746
      Shakeel Butt 提交于
      commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.
      
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      de7ca746
    • C
      mm, memcg: consider subtrees in memory.events · 0a198bb9
      Chris Down 提交于
      commit 9852ae3fe5293264f01c49f2571ef7688f7823ce upstream.
      
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      [xuyu: remove the new memory_localevents mount option because it is
      rarely used]
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      0a198bb9
    • X
      alinux: mm,memcg: export memory.high to v1 · 7d36553b
      Xu Yu 提交于
      Export "memory.high" from cgroup v2 to v1 which can be used to archive
      some memory QoS.
      
      This is also a way of migrating to v2 gradually.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7d36553b
    • J
      arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64 · f72a099b
      Jia He 提交于
      commit 30e235389faadb9e3d918887b1f126155d7d761d upstream.
      
      Without this patch, the MAP_SYNC test case will cause a print_bad_pte
      warning on arm64 as follows:
      
      [   25.542693] BUG: Bad page map in process mapdax333 pte:2e8000448800f53 pmd:41ff5f003
      [   25.546360] page:ffff7e0010220000 refcount:1 mapcount:-1 mapping:ffff8003e29c7440 index:0x0
      [   25.550281] ext4_dax_aops
      [   25.550282] name:"__aaabbbcccddd__"
      [   25.551553] flags: 0x3ffff0000001002(referenced|reserved)
      [   25.555802] raw: 03ffff0000001002 ffff8003dfffa908 0000000000000000 ffff8003e29c7440
      [   25.559446] raw: 0000000000000000 0000000000000000 00000001fffffffe 0000000000000000
      [   25.563075] page dumped because: bad pte
      [   25.564938] addr:0000ffffbe05b000 vm_flags:208000fb anon_vma:0000000000000000 mapping:ffff8003e29c7440 index:0
      [   25.574272] file:__aaabbbcccddd__ fault:ext4_dax_fault mmmmap:ext4_file_mmap readpage:0x0
      [   25.578799] CPU: 1 PID: 1180 Comm: mapdax333 Not tainted 5.2.0+ #21
      [   25.581702] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [   25.585624] Call trace:
      [   25.587008]  dump_backtrace+0x0/0x178
      [   25.588799]  show_stack+0x24/0x30
      [   25.590328]  dump_stack+0xa8/0xcc
      [   25.591901]  print_bad_pte+0x18c/0x218
      [   25.593628]  unmap_page_range+0x778/0xc00
      [   25.595506]  unmap_single_vma+0x94/0xe8
      [   25.597304]  unmap_vmas+0x90/0x108
      [   25.598901]  unmap_region+0xc0/0x128
      [   25.600566]  __do_munmap+0x284/0x3f0
      [   25.602245]  __vm_munmap+0x78/0xe0
      [   25.603820]  __arm64_sys_munmap+0x34/0x48
      [   25.605709]  el0_svc_common.constprop.0+0x78/0x168
      [   25.607956]  el0_svc_handler+0x34/0x90
      [   25.609698]  el0_svc+0x8/0xc
      [...]
      
      The root cause is in _vm_normal_page, without the PTE_SPECIAL bit,
      the return value will be incorrectly set to pfn_to_page(pfn) instead
      of NULL. Besides, this patch also rewrite the pmd_mkdevmap to avoid
      setting PTE_SPECIAL for pmd
      
      The MAP_SYNC test case is as follows(Provided by Yibo Cai)
      $#include <stdio.h>
      $#include <string.h>
      $#include <unistd.h>
      $#include <sys/file.h>
      $#include <sys/mman.h>
      
      $#ifndef MAP_SYNC
      $#define MAP_SYNC 0x80000
      $#endif
      
      /* mount -o dax /dev/pmem0 /mnt */
      $#define F "/mnt/__aaabbbcccddd__"
      
      int main(void)
      {
          int fd;
          char buf[4096];
          void *addr;
      
          if ((fd = open(F, O_CREAT|O_TRUNC|O_RDWR, 0644)) < 0) {
              perror("open1");
              return 1;
          }
      
          if (write(fd, buf, 4096) != 4096) {
              perror("lseek");
              return 1;
          }
      
          addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_SYNC, fd, 0);
          if (addr == MAP_FAILED) {
              perror("mmap");
              printf("did you mount with '-o dax'?\n");
              return 1;
          }
      
          memset(addr, 0x55, 4096);
      
          if (munmap(addr, 4096) == -1) {
              perror("munmap");
              return 1;
          }
      
          close(fd);
      
          return 0;
      }
      
      Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NRobin Murphy <Robin.Murphy@arm.com>
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      f72a099b
    • R
      arm64: mm: implement pte_devmap support · 1820ca63
      Robin Murphy 提交于
      commit 73b20c84d42de14673a987816dd4d132c7b1f801 upstream.
      
      In order for things like get_user_pages() to work on ZONE_DEVICE memory,
      we need a software PTE bit to identify device-backed PFNs.  Hook this up
      along with the relevant helpers to join in with ARCH_HAS_PTE_DEVMAP.
      
      [robin.murphy@arm.com: build fixes]
        Link: http://lkml.kernel.org/r/13026c4e64abc17133bbfa07d7731ec6691c0bcd.1559050949.git.robin.murphy@arm.com
      Link: http://lkml.kernel.org/r/817d92886fc3b33bcbf6e105ee83a74babb3a5aa.1558547956.git.robin.murphy@arm.comSigned-off-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      1820ca63
    • R
      mm: introduce ARCH_HAS_PTE_DEVMAP · d7066a91
      Robin Murphy 提交于
      commit 175967318c3018d01931ac950c82adab5deb47ca upstream.
      
      ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
      with the long-out-of-date comment can lead to the impression than an
      architecture may just enable it (since __add_pages() now "comprehends
      device memory" for itself) and expect things to work.
      
      In practice, however, ZONE_DEVICE users have little chance of
      functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
      that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
      dependency so the real situation is clearer.
      
      Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.comSigned-off-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d7066a91
    • M
      arm64: ftrace: fix ifdeffery · a990f965
      Mark Rutland 提交于
      commit 70927d02d409b5a79c3ed040ace5017da8284ede upstream.
      
      When I tweaked the ftrace entry assembly in commit:
      
        3b23e4991fb66f6d ("arm64: implement ftrace with regs")
      
      ... my ifdeffery tweaks left ftrace_graph_caller undefined for
      CONFIG_DYNAMIC_FTRACE && CONFIG_FUNCTION_GRAPH_TRACER when ftrace is
      based on mcount.
      
      The kbuild test robot reported that this issue is detected at link time:
      
      | arch/arm64/kernel/entry-ftrace.o: In function `skip_ftrace_call':
      | arch/arm64/kernel/entry-ftrace.S:238: undefined reference to `ftrace_graph_caller'
      | arch/arm64/kernel/entry-ftrace.S:238:(.text+0x3c): relocation truncated to fit: R_AARCH64_CONDBR19 against undefined symbol
      | `ftrace_graph_caller'
      | arch/arm64/kernel/entry-ftrace.S:243: undefined reference to `ftrace_graph_caller'
      | arch/arm64/kernel/entry-ftrace.S:243:(.text+0x54): relocation truncated to fit: R_AARCH64_CONDBR19 against undefined symbol
      | `ftrace_graph_caller'
      
      This patch fixes the ifdeffery so that the mcount version of
      ftrace_graph_caller doesn't depend on CONFIG_DYNAMIC_FTRACE. At the same
      time, a redundant #else is removed from the ifdeffery for the
      patchable-function-entry version of ftrace_graph_caller.
      
      Fixes: 3b23e4991fb66f6d ("arm64: implement ftrace with regs")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Torsten Duwe <duwe@lst.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      a990f965
    • Z
      alinux: arm64: fixed _mcount undefined reference error · 53b89c33
      Zou Cao 提交于
      fixed warnging as follow:
      arm64ksyms.c:(___ksymtab+_mcount+0x0): undefined reference to `_mcount'
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      53b89c33
    • M
      arm64: ftrace: always pass instrumented pc in x0 · 54760b8d
      Mark Rutland 提交于
      commit 7dc48bf96aa0fc8aa5b38cc3e5c36ac03171e680 upstream.
      
      The core ftrace hooks take the instrumented PC in x0, but for some
      reason arm64's prepare_ftrace_return() takes this in x1.
      
      For consistency, let's flip the argument order and always pass the
      instrumented PC in x0.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      54760b8d
    • M
      arm64: ftrace: remove return_regs macros · a7cd9c60
      Mark Rutland 提交于
      commit 49e258e05e8e56d53af20be481b311c43d7c286b upstream.
      
      The save_return_regs and restore_return_regs macros are only used by
      return_to_handler, and having them defined out-of-line only serves to
      obscure the logic.
      
      Before we complicate, let's clean this up and fold the logic directly
      into return_to_handler, saving a few lines of macro boilerplate in the
      process. At the same time, a missing trailing space is added to the
      comments, fixing a code style violation.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      a7cd9c60
    • M
      linkage: add generic GLOBAL() macro · 37889fe9
      Mark Rutland 提交于
      commit ad697a1aecac19ec351063b5d8e6fc9d4bca7ee5 upstream.
      
      Declaring a global symbol in assembly is tedious, error-prone, and
      painful to read. While ENTRY() exists, this is supposed to be used for
      function entry points, and this affects alignment in a potentially
      undesireable manner.
      
      Instead, let's add a generic GLOBAL() macro for this, as x86 added
      locally in commit:
      
        95695547 ("x86: asm linkage - introduce GLOBAL macro")
      
      ... thus allowing us to use this more freely in the kernel.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      37889fe9
    • S
      compiler.h: add CC_USING_PATCHABLE_FUNCTION_ENTRY · 7f887650
      Sven Schnelle 提交于
      commit 2809b392a62ae307da058a52d451b2fc3ce4de7e upstream.
      
      This can be used for architectures implementing dynamic
      ftrace via -fpatchable-function-entry.
      Signed-off-by: NSven Schnelle <svens@stackframe.org>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      7f887650
    • T
      arm64: implement ftrace with regs · 1f77f2fc
      Torsten Duwe 提交于
      commit 3b23e4991fb66f6d152f9055ede271a726ef9f21 upstream
      
      This patch implements FTRACE_WITH_REGS for arm64, which allows a traced
      function's arguments (and some other registers) to be captured into a
      struct pt_regs, allowing these to be inspected and/or modified. This is
      a building block for live-patching, where a function's arguments may be
      forwarded to another function. This is also necessary to enable ftrace
      and in-kernel pointer authentication at the same time, as it allows the
      LR value to be captured and adjusted prior to signing.
      
      Using GCC's -fpatchable-function-entry=N option, we can have the
      compiler insert a configurable number of NOPs between the function entry
      point and the usual prologue. This also ensures functions are AAPCS
      compliant (e.g. disabling inter-procedural register allocation).
      
      For example, with -fpatchable-function-entry=2, GCC 8.1.0 compiles the
      following:
      
      | unsigned long bar(void);
      |
      | unsigned long foo(void)
      | {
      |         return bar() + 1;
      | }
      
      ... to:
      
      | <foo>:
      |         nop
      |         nop
      |         stp     x29, x30, [sp, #-16]!
      |         mov     x29, sp
      |         bl      0 <bar>
      |         add     x0, x0, #0x1
      |         ldp     x29, x30, [sp], #16
      |         ret
      
      This patch builds the kernel with -fpatchable-function-entry=2,
      prefixing each function with two NOPs. To trace a function, we replace
      these NOPs with a sequence that saves the LR into a GPR, then calls an
      ftrace entry assembly function which saves this and other relevant
      registers:
      
      | mov	x9, x30
      | bl	<ftrace-entry>
      
      Since patchable functions are AAPCS compliant (and the kernel does not
      use x18 as a platform register), x9-x18 can be safely clobbered in the
      patched sequence and the ftrace entry code.
      
      There are now two ftrace entry functions, ftrace_regs_entry (which saves
      all GPRs), and ftrace_entry (which saves the bare minimum). A PLT is
      allocated for each within modules.
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      [Mark: rework asm, comments, PLTs, initialization, commit message]
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Julien Thierry <jthierry@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      1f77f2fc
    • M
      arm64: asm-offsets: add S_FP · ecb06c7b
      Mark Rutland 提交于
      commit 1f377e043b3b8ef68caffe47bdad794f4e2cb030 upstream
      
      So that assembly code can more easily manipulate the FP (x29) within a
      pt_regs, add an S_FP asm-offsets definition.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      ecb06c7b
    • M
      arm64: insn: add encoder for MOV (register) · 88892d31
      Mark Rutland 提交于
      commit e3bf8a67f759b498e09999804c3837688e03b304 upstream
      
      For FTRACE_WITH_REGS, we're going to want to generate a MOV (register)
      instruction as part of the callsite intialization. As MOV (register) is
      an alias for ORR (shifted register), we can generate this with
      aarch64_insn_gen_logical_shifted_reg(), but it's somewhat verbose and
      difficult to read in-context.
      
      Add a aarch64_insn_gen_move_reg() wrapper for this case so that we can
      write callers in a more straightforward way.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      88892d31
    • M
      arm64: module/ftrace: intialize PLT at load time · 19f2b4ae
      Mark Rutland 提交于
      commit f1a54ae9af0da4d76239256ed640a93ab3aadac0 upstream
      
      Currently we lazily-initialize a module's ftrace PLT at runtime when we
      install the first ftrace call. To do so we have to apply a number of
      sanity checks, transiently mark the module text as RW, and perform an
      IPI as part of handling Neoverse-N1 erratum #1542419.
      
      We only expect the ftrace trampoline to point at ftrace_caller() (AKA
      FTRACE_ADDR), so let's simplify all of this by intializing the PLT at
      module load time, before the module loader marks the module RO and
      performs the intial I-cache maintenance for the module.
      
      Thus we can rely on the module having been correctly intialized, and can
      simplify the runtime work necessary to install an ftrace call in a
      module. This will also allow for the removal of module_disable_ro().
      
      Tested by forcing ftrace_make_call() to use the module PLT, and then
      loading up a module after setting up ftrace with:
      
      | echo ":mod:<module-name>" > set_ftrace_filter;
      | echo function > current_tracer;
      | modprobe <module-name>
      
      Since FTRACE_ADDR is only defined when CONFIG_DYNAMIC_FTRACE is
      selected, we wrap its use along with most of module_init_ftrace_plt()
      with ifdeffery rather than using IS_ENABLED().
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      19f2b4ae
    • M
      arm64: module: rework special section handling · d4199e8c
      Mark Rutland 提交于
      commit bd8b21d3dd661658addc1cd4cc869bab11d28596 upstream
      
      When we load a module, we have to perform some special work for a couple
      of named sections. To do this, we iterate over all of the module's
      sections, and perform work for each section we recognize.
      
      To make it easier to handle the unexpected absence of a section, and to
      make the section-specific logic easer to read, let's factor the section
      search into a helper. Similar is already done in the core module loader,
      and other architectures (and ideally we'd unify these in future).
      
      If we expect a module to have an ftrace trampoline section, but it
      doesn't have one, we'll now reject loading the module. When
      ARM64_MODULE_PLTS is selected, any correctly built module should have
      one (and this is assumed by arm64's ftrace PLT code) and the absence of
      such a section implies something has gone wrong at build time.
      
      Subsequent patches will make use of the new helper.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      d4199e8c
    • M
      module/ftrace: handle patchable-function-entry · 0f596d7d
      Mark Rutland 提交于
      backport from a1326b17ac03a9012cb3d01e434aacb4d67a416c upstream
      
      When using patchable-function-entry, the compiler will record the
      callsites into a section named "__patchable_function_entries" rather
      than "__mcount_loc". Let's abstract this difference behind a new
      FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
      explicitly (e.g. with custom module linker scripts).
      
      As parisc currently handles this explicitly, it is fixed up accordingly,
      with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
      only defined when DYNAMIC_FTRACE is selected, the parisc module loading
      code is updated to only use the definition in that case. When
      DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
      this removes some redundant work in that case.
      
      To make sure that this is keep up-to-date for modules and the main
      kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
      simplified for legibility.
      
      I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
      and verified that the section made it into the .ko files for modules.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NHelge Deller <deller@gmx.de>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NSven Schnelle <svens@stackframe.org>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: linux-parisc@vger.kernel.org
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      0f596d7d
    • M
      ftrace: add ftrace_init_nop() · 21aca133
      Mark Rutland 提交于
      commit fbf6c73c5b264c25484fa9f449b5546569fe11f0 upstream
      
      Architectures may need to perform special initialization of ftrace
      callsites, and today they do so by special-casing ftrace_make_nop() when
      the expected branch address is MCOUNT_ADDR. In some cases (e.g. for
      patchable-function-entry), we don't have an mcount-like symbol and don't
      want a synthetic MCOUNT_ADDR, but we may need to perform some
      initialization of callsites.
      
      To make it possible to separate initialization from runtime
      modification, and to handle cases without an mcount-like symbol, this
      patch adds an optional ftrace_init_nop() function that architectures can
      implement, which does not pass a branch address.
      
      Where an architecture does not provide ftrace_init_nop(), we will fall
      back to the existing behaviour of calling ftrace_make_nop() with
      MCOUNT_ADDR.
      
      At the same time, ftrace_code_disable() is renamed to
      ftrace_nop_initialize() to make it clearer that it is intended to
      intialize a callsite into a disabled state, and is not for disabling a
      callsite that has been runtime enabled. The kerneldoc description of rec
      arguments is updated to cover non-mcount callsites.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Tested-by: NSven Schnelle <svens@stackframe.org>
      Tested-by: NTorsten Duwe <duwe@suse.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      21aca133
    • T
      kasan: Makefile: Replace -pg with CC_FLAGS_FTRACE · d97b0fdf
      Torsten Duwe 提交于
      commit e2092740b72384e95b9c8a418536b35d628a1642 upstream.
      
      In preparation for arm64 supporting ftrace built on other compiler
      options, let's have Makefiles remove the $(CC_FLAGS_FTRACE) flags,
      whatever these may be, rather than assuming '-pg'.
      
      There should be no functional change as a result of this patch.
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      d97b0fdf
    • T
      efi/arm/arm64: Makefile: Replace -pg with CC_FLAGS_FTRACE · 92015762
      Torsten Duwe 提交于
      commit e1a7eafb7350ff118e7538aca76b3f79e9280a8f upstream.
      
      In preparation for arm64 supporting ftrace built on other compiler
      options, let's have Makefiles remove the $(CC_FLAGS_FTRACE) flags,
      whatever these maybe, rather than assuming '-pg'.
      
      While at it, fix arm32 as well. There should be no functional change as
      a result of this patch.
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      92015762
    • T
      arm64: Makefile: Replace -pg with CC_FLAGS_FTRACE · 007a7aa5
      Torsten Duwe 提交于
      commit edf072d36dbfdf74465b66988f30084b6c996fbf upstream.
      
      In preparation for arm64 supporting ftrace built on other compiler
      options, let's have the arm64 Makefiles remove the $(CC_FLAGS_FTRACE)
      flags, whatever these may be, rather than assuming '-pg'.
      
      There should be no functional change as a result of this patch.
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      007a7aa5
    • M
      arm64: ftrace: use GLOBAL() · 9709ac64
      Mark Rutland 提交于
      commit e4fe196642678565766815d99ab98a3a32d72dd4 upstream.
      
      The global exports of ftrace_call and ftrace_graph_call are somewhat
      painful to read. Let's use the generic GLOBAL() macro to ameliorate
      matters.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      9709ac64