1. 18 3月, 2020 40 次提交
    • T
      sched: Remove stale PF_MUTEX_TESTER bit · 04695e1f
      Thomas Gleixner 提交于
      commit 15917dc02841862840efcbfe1da0830f88078b5c upstream.
      
      The RTMUTEX tester was removed long ago but the PF bit stayed
      around. Remove it and free up the space.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      04695e1f
    • D
      io_uring: add set of tracing events · 6ffa9fe6
      Dmitrii Dolgov 提交于
      commit c826bd7a743f275e2b68c16d595534063b400deb upstream.
      
      To trace io_uring activity one can get an information from workqueue and
      io trace events, but looks like some parts could be hard to identify via
      this approach. Making what happens inside io_uring more transparent is
      important to be able to reason about many aspects of it, hence introduce
      the set of tracing events.
      
      All such events could be roughly divided into two categories:
      
      * those, that are helping to understand correctness (from both kernel
        and an application point of view). E.g. a ring creation, file
        registration, or waiting for available CQE. Proposed approach is to
        get a pointer to an original structure of interest (ring context, or
        request), and then find relevant events. io_uring_queue_async_work
        also exposes a pointer to work_struct, to be able to track down
        corresponding workqueue events.
      
      * those, that provide performance related information. Mostly it's about
        events that change the flow of requests, e.g. whether an async work
        was queued, or delayed due to some dependencies. Another important
        case is how io_uring optimizations (e.g. registered files) are
        utilized.
      Signed-off-by: NDmitrii Dolgov <9erthalion6@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      6ffa9fe6
    • J
      io_uring: add support for canceling timeout requests · f39f9b94
      Jens Axboe 提交于
      commit 11365043e5271fea4c92189a976833da477a3a44 upstream.
      
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f39f9b94
    • J
      io_uring: add support for absolute timeouts · 17da935f
      Jens Axboe 提交于
      commit a41525ab2e75987e809926352ebc6f1397da900e upstream.
      
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      17da935f
    • J
      io_uring: allow application controlled CQ ring size · 29635426
      Jens Axboe 提交于
      commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 upstream.
      
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      29635426
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · 54633bb3
      Jens Axboe 提交于
      commit c3a31e605620c279163c14068a60869ea3fda203 upstream.
      
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54633bb3
    • J
      io_uring: IORING_OP_TIMEOUT support · 94601f2b
      Jens Axboe 提交于
      commit 5262f567987d3c30052b22e78c35c2313d07b230 upstream.
      
      There's been a few requests for functionality similar to io_getevents()
      and epoll_wait(), where the user can specify a timeout for waiting on
      events. I deliberately did not add support for this through the system
      call initially to avoid overloading the args, but I can see that the use
      cases for this are valid.
      
      This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
      when waiting for events, simply submit one of these timeout commands
      with your wait call (or before). This ensures that the application
      sleeping on the CQ ring waiting for events will get woken. The timeout
      command is passed in as a pointer to a struct timespec. Timeouts are
      relative. The timeout command also includes a way to auto-cancel after
      N events has passed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      94601f2b
    • J
      io_uring: expose single mmap capability · c15b5945
      Jens Axboe 提交于
      commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 upstream.
      
      After commit 75b28affdd6a we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c15b5945
    • S
      include/linux/notifier.h: SRCU: fix ctags · ffeba5d0
      Sam Protsenko 提交于
      commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream.
      
      ctags indexing ("make tags" command) throws this warning:
      
          ctags: Warning: include/linux/notifier.h:125:
          null expansion of name pattern "\1"
      
      This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
      getting rid of line break.
      
      Similar fix was already done in commit 25528213 ("tags: Fix
      DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.
      
      Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
      Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
      Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ffeba5d0
    • Z
      io_uring: track io length in async_list based on bytes · e280026f
      Zhengyuan Liu 提交于
      commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream.
      
      We are using PAGE_SIZE as the unit to determine if the total len in
      async_list has exceeded max_pages, it's not fair for smaller io sizes.
      For example, if we are doing 1k-size io streams, we will never exceed
      max_pages since len >>= PAGE_SHIFT always gets zero. So use original
      bytes to make it more accurate.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e280026f
    • O
      signal: simplify set_user_sigmask/restore_user_sigmask · f12f9562
      Oleg Nesterov 提交于
      commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.
      
      task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
      syscall paths.  This means that set_user_sigmask() can save ->blocked in
      ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
      was modified.
      
      This way the callers do not need 2 sigset_t's passed to set/restore and
      restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
      into the trivial helper which just calls restore_saved_sigmask().
      
      Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Eric Wong <e@80x24.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Laight <David.Laight@aculab.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f12f9562
    • J
      io_uring: add support for recvmsg() · 3962b3d0
      Jens Axboe 提交于
      commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.
      
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3962b3d0
    • J
      io_uring: add support for sendmsg() · 0cb8acf9
      Jens Axboe 提交于
      commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.
      
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0cb8acf9
    • C
      block: never take page references for ITER_BVEC · 709d159e
      Christoph Hellwig 提交于
      Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.
      
      If we pass pages through an iov_iter we always already have a reference
      in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
      reference to pages by default for bvec backed iov_iters.
      
      [Joseph] Resolve conflicts since we don't have:
      81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
      7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
      Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      709d159e
    • O
      signal: remove the wrong signal_pending() check in restore_user_sigmask() · a48e4674
      Oleg Nesterov 提交于
      commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.
      
      This is the minimal fix for stable, I'll send cleanups later.
      
      Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
      the visible change which breaks user-space: a signal temporary unblocked
      by set_user_sigmask() can be delivered even if the caller returns
      success or timeout.
      
      Change restore_user_sigmask() to accept the additional "interrupted"
      argument which should be used instead of signal_pending() check, and
      update the callers.
      
      Eric said:
      
      : For clarity.  I don't think this is required by posix, or fundamentally to
      : remove the races in select.  It is what linux has always done and we have
      : applications who care so I agree this fix is needed.
      :
      : Further in any case where the semantic change that this patch rolls back
      : (aka where allowing a signal to be delivered and the select like call to
      : complete) would be advantage we can do as well if not better by using
      : signalfd.
      :
      : Michael is there any chance we can get this guarantee of the linux
      : implementation of pselect and friends clearly documented.  The guarantee
      : that if the system call completes successfully we are guaranteed that no
      : signal that is unblocked by using sigmask will be delivered?
      
      Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
      Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NEric Wong <e@80x24.org>
      Tested-by: NEric Wong <e@80x24.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a48e4674
    • J
      io_uring: add support for sqe links · fda445b3
      Jens Axboe 提交于
      commit 9e645e1105ca60fbbc6bddf2fd5ef7e57ed3dca8 upstream.
      
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fda445b3
    • J
      uio: make import_iovec()/compat_import_iovec() return bytes on success · 0c13034a
      Jens Axboe 提交于
      commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream.
      
      Currently these functions return < 0 on error, and 0 for success.
      Change that so that we return < 0 on error, but number of bytes
      for success.
      
      Some callers already treat the return value that way, others need a
      slight tweak.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0c13034a
    • J
      io_uring: add support for eventfd notifications · 3707251b
      Jens Axboe 提交于
      commit 9b402849e80c85eee10bbd341aab3f1a0f942d4f upstream.
      
      Allow registration of an eventfd, which will trigger an event every
      time a completion event happens for this io_uring instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3707251b
    • J
      io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 2c12f33e
      Jens Axboe 提交于
      commit 5d17b4a4b7fa172b205be8a05051ae705d1dc3bb upstream.
      
      This behaves just like sync_file_range(2) does.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2c12f33e
    • J
      fs: add sync_file_range() helper · ceda0208
      Jens Axboe 提交于
      commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 upstream.
      
      This just pulls out the ksys_sync_file_range() code to work on a struct
      file instead of an fd, so we can use it elsewhere.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ceda0208
    • J
      io_uring: add support for marking commands as draining · b52f2397
      Jens Axboe 提交于
      commit de0617e467171ba44c73efd1ba63f101b164a035 upstream.
      
      There are no ordering constraints between the submission and completion
      side of io_uring. But sometimes that would be useful to have. One common
      example is doing an fsync, for instance, and have it ordered with
      previous writes. Without support for that, the application must do this
      tracking itself.
      
      This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
      with this flag, then it will not be issued before previous commands have
      completed, and subsequent commands submitted after the drain will not be
      issued before the drain is started.. If there are no pending commands,
      setting this flag will not change the behavior of the issue of the
      command.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b52f2397
    • S
      perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 54c387a7
      Shameer Kolothum 提交于
      commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream
      
      HiSilicon erratum 162001800 describes the limitation of
      SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.
      
      On these platforms, the PMCG event counter registers
      (SMMU_PMCG_EVCNTRn) are read only and as a result it
      is not possible to set the initial counter period value
      on event monitor start.
      
      To work around this, the current value of the counter
      is read and used for delta calculations. OEM information
      from ACPI header is used to identify the affected hardware
      platforms.
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      [will: update silicon-errata.txt and add reason string to acpi match]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      54c387a7
    • N
      ACPI/IORT: Add support for PMCG · 9c6dfb51
      Neil Leeder 提交于
      commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.
      
      Add support for the SMMU Performance Monitor Counter Group
      information from ACPI. This is in preparation for its use
      in the SMMUv3 PMU driver.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9c6dfb51
    • P
      mm/hotplug: make remove_memory() interface usable · f8502f80
      Pavel Tatashin 提交于
      commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream
      
      Presently the remove_memory() interface is inherently broken.  It tries
      to remove memory but panics if some memory is not offline.  The problem
      is that it is impossible to ensure that all memory blocks are offline as
      this function also takes lock_device_hotplug that is required to change
      memory state via sysfs.
      
      So, between calling this function and offlining all memory blocks there
      is always a window when lock_device_hotplug is released, and therefore,
      there is always a chance for a panic during this window.
      
      Make this interface to return an error if memory removal fails.  This
      way it is safe to call this function without panicking machine, and also
      makes it symmetric to add_memory() which already returns an error.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      f8502f80
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d2097173
      David Hildenbrand 提交于
      commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream
      
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d2097173
    • A
      mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · f38de7b3
      Alexander Duyck 提交于
      commit 0e56acae4b4dd4a9fbe897854ab83a109e2a9e11 upstream.
      
      Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
      use it to support initializing and freeing pages in groups no larger than
      MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
      locality of the pages while we do several loops over them in the init and
      freeing process.
      
      We are able to tighten the loops further as a result of the "from"
      iterator as we can perform the initial checks for first_init_pfn in our
      first call to the iterator, and continue without the need for those checks
      via the "from" iterator.  I have added this functionality in the function
      called deferred_init_mem_pfn_range_in_zone that primes the iterator and
      causes us to exit if we encounter any failure.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 1.85s to 1.38s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f38de7b3
    • A
      mm: implement new zone specific memblock iterator · ad97e5e4
      Alexander Duyck 提交于
      commit 837566e7e08e3f89444166444836a8a49b9f9322 upstream.
      
      Introduce a new iterator for_each_free_mem_pfn_range_in_zone.
      
      This iterator will take care of making sure a given memory range provided
      is in fact contained within a zone.  It takes are of all the bounds
      checking we were doing in deferred_grow_zone, and deferred_init_memmap.
      In addition it should help to speed up the search a bit by iterating until
      the end of a range is greater than the start of the zone pfn range, and
      will exit completely if the start is beyond the end of the zone.
      
      Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ad97e5e4
    • A
      mm: use mm_zero_struct_page from SPARC on all 64b architectures · e23b0cb5
      Alexander Duyck 提交于
      commit 5470dea49f5382257c242ac617d908267727f1a8 upstream.
      
      Patch series "Deferred page init improvements", v7.
      
      This patchset is essentially a refactor of the page initialization logic
      that is meant to provide for better code reuse while providing a
      significant improvement in deferred page initialization performance.
      
      In my testing on an x86_64 system with 384GB of RAM I have seen the
      following.  In the case of regular memory initialization the deferred init
      time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
      improvement for the deferred memory initialization performance.
      
      I have called out the improvement observed with each patch.
      
      This patch (of 4):
      
      Use the same approach that was already in use on Sparc on all the
      architectures that support a 64b long.
      
      This is mostly motivated by the fact that 7 to 10 store/move instructions
      are likely always going to be faster than having to call into a function
      that is not specialized for handling page init.
      
      An added advantage to doing it this way is that the compiler can get away
      with combining writes in the __init_single_page call.  As a result the
      memset call will be reduced to only about 4 write operations, or at least
      that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
      count/mapcount seem to be cancelling out at least 4 of the 8 assignments
      on my system.
      
      One change I had to make to the function was to reduce the minimum page
      size to 56 to support some powerpc64 configurations.
      
      This change should introduce no change on SPARC since it already had this
      code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
      initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
      with Broadcom's Stingray CPU and 48GB of RAM and found that
      __init_single_page() takes 19.30ns / 64-byte struct page before this patch
      and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
      ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
      and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
      and 4.59ns after this patch.
      
      Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e23b0cb5
    • M
      blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240
      Ming Lei 提交于
      commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.
      
      Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
      from block layer's view, actually they don't because userspace may
      grab one kobject anytime via sysfs.
      
      This patch fixes the issue by the following approach:
      
      1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
      all ctxs
      
      2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
      handler of .mq_kobj
      
      3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
      .mq_kobj is always released after all ctxs are freed.
      
      This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
      is enabled.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9ff28240
    • Q
      mm/memblock.c: skip kmemleak for kasan_init() · 9f093569
      Qian Cai 提交于
      commit fed84c78527009d4f799a3ed9a566502fa026d82 upstream.
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9f093569
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 83cd9d23
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      83cd9d23
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
    • X
      alinux: blk: add iohang check function · 80d6ee24
      Xiaoguang Wang 提交于
      Background:
        We do not have a dependable block layer interface to determine whether
      block device has io requests which have not been completed for somewhat
      long time. Currently we have 'in_flight' interface, it counts the number
      of I/O requests that have been issued to the device driver but have
      not yet completed, and it does not include I/O requests that are in the
      queue but not yet issued to the device driver, which means it will not
      count io requests that have been stucked in block layer.
        Also say that there are steady io requests issued to device driver,
      'in_flight' maybe always non-zero, but you could not determine whether
      there is one io request which has not been completed for too long.
      
      Solution:
        To find io requests which have not been completed for too long, here
      add 3 new inferfaces:
        /sys/block/vdb/queue/hang_threshold
      If one io request's running time has been greater than this value, count
      this io as hang.
      
        /sys/block/vdb/hang
      Show read/write io requests' hang counter.
      
        /sys/kernel/debug/block/vdb/rq_hang
      Show all hang io requests's detailed info, like below:
        ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
      ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
      .start_time_ns=140634088407, .io_start_time_ns=140634102958,
      .current_time=146497371953, .bio = ffff97db91e8e000,
      .bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
      .bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
      .bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
      .bio_pages = { ffffd096a0600a80 }}
      
      With above info, we can easily see this request's latency distribution,
      and see next patch for bio_pages's usage.
      
      Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
      and needs CONFIG_BLK_DEBUG_FS enabled.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      80d6ee24
    • T
      tcp: Add snd_wnd to TCP_INFO · ecee8235
      Thomas Higdon 提交于
      commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream
      
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      ecee8235
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · 0107d737
      Thomas Higdon 提交于
      commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream
      
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      0107d737
    • S
      mm, memcg: introduce memory.events.local · de7ca746
      Shakeel Butt 提交于
      commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.
      
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      de7ca746
    • C
      mm, memcg: consider subtrees in memory.events · 0a198bb9
      Chris Down 提交于
      commit 9852ae3fe5293264f01c49f2571ef7688f7823ce upstream.
      
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      [xuyu: remove the new memory_localevents mount option because it is
      rarely used]
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      0a198bb9
    • R
      mm: introduce ARCH_HAS_PTE_DEVMAP · d7066a91
      Robin Murphy 提交于
      commit 175967318c3018d01931ac950c82adab5deb47ca upstream.
      
      ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
      with the long-out-of-date comment can lead to the impression than an
      architecture may just enable it (since __add_pages() now "comprehends
      device memory" for itself) and expect things to work.
      
      In practice, however, ZONE_DEVICE users have little chance of
      functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
      that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
      dependency so the real situation is clearer.
      
      Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.comSigned-off-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d7066a91
    • M
      linkage: add generic GLOBAL() macro · 37889fe9
      Mark Rutland 提交于
      commit ad697a1aecac19ec351063b5d8e6fc9d4bca7ee5 upstream.
      
      Declaring a global symbol in assembly is tedious, error-prone, and
      painful to read. While ENTRY() exists, this is supposed to be used for
      function entry points, and this affects alignment in a potentially
      undesireable manner.
      
      Instead, let's add a generic GLOBAL() macro for this, as x86 added
      locally in commit:
      
        95695547 ("x86: asm linkage - introduce GLOBAL macro")
      
      ... thus allowing us to use this more freely in the kernel.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      37889fe9
    • S
      compiler.h: add CC_USING_PATCHABLE_FUNCTION_ENTRY · 7f887650
      Sven Schnelle 提交于
      commit 2809b392a62ae307da058a52d451b2fc3ce4de7e upstream.
      
      This can be used for architectures implementing dynamic
      ftrace via -fpatchable-function-entry.
      Signed-off-by: NSven Schnelle <svens@stackframe.org>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      7f887650