1. 24 4月, 2020 6 次提交
  2. 23 4月, 2020 1 次提交
    • M
      mm, compaction: capture a page under direct compaction · 35d915be
      Mel Gorman 提交于
      to #26255339
      
      commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream
      
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      35d915be
  3. 22 4月, 2020 4 次提交
    • C
      sysctl: handle overflow in proc_get_long · 662ef34f
      Christian Brauner 提交于
      fix #27124689
      
      commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream.
      
      proc_get_long() is a funny function.  It uses simple_strtoul() and for a
      good reason.  proc_get_long() wants to always succeed the parse and
      return the maybe incorrect value and the trailing characters to check
      against a pre-defined list of acceptable trailing values.  However,
      simple_strtoul() explicitly ignores overflows which can cause funny
      things like the following to happen:
      
        echo 18446744073709551616 > /proc/sys/fs/file-max
        cat /proc/sys/fs/file-max
        0
      
      (Which will cause your system to silently die behind your back.)
      
      On the other hand kstrtoul() does do overflow detection but does not
      return the trailing characters, and also fails the parse when anything
      other than '\n' is a trailing character whereas proc_get_long() wants to
      be more lenient.
      
      Now, before adding another kstrtoul() function let's simply add a static
      parse strtoul_lenient() which:
       - fails on overflow with -ERANGE
       - returns the trailing characters to the caller
      
      The reason why we should fail on ERANGE is that we already do a partial
      fail on overflow right now.  Namely, when the TMPBUFLEN is exceeded.  So
      we already reject values such as 184467440737095516160 (21 chars) but
      accept values such as 18446744073709551616 (20 chars) but both are
      overflows.  So we should just always reject 64bit overflows and not
      special-case this based on the number of chars.
      
      Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      662ef34f
    • M
      sched: Avoid scale real weight down to zero · 9b83fd88
      Michael Wang 提交于
      fix #26198889
      
      commit 26cf52229efc87e2effa9d788f9b33c40fb3358a linux-next
      
      During our testing, we found a case that shares no longer
      working correctly, the cgroup topology is like:
      
        /sys/fs/cgroup/cpu/A		(shares=102400)
        /sys/fs/cgroup/cpu/A/B	(shares=2)
        /sys/fs/cgroup/cpu/A/B/C	(shares=1024)
      
        /sys/fs/cgroup/cpu/D		(shares=1024)
        /sys/fs/cgroup/cpu/D/E	(shares=1024)
        /sys/fs/cgroup/cpu/D/E/F	(shares=1024)
      
      The same benchmark is running in group C & F, no other tasks are
      running, the benchmark is capable to consumed all the CPUs.
      
      We suppose the group C will win more CPU resources since it could
      enjoy all the shares of group A, but it's F who wins much more.
      
      The reason is because we have group B with shares as 2, since
      A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
      so A->cfs_rq.load.weight become very small.
      
      And in calc_group_shares() we calculate shares as:
      
        load = max(scale_load_down(cfs_rq->load.weight),
      cfs_rq->avg.load_avg);
        shares = (tg_shares * load) / tg_weight;
      
      Since the 'cfs_rq->load.weight' is too small, the load become 0
      after scale down, although 'tg_shares' is 102400, shares of the se
      which stand for group A on root cfs_rq become 2.
      
      While the se of D on root cfs_rq is far more bigger than 2, so it
      wins the battle.
      
      Thus when scale_load_down() scale real weight down to 0, it's no
      longer telling the real story, the caller will have the wrong
      information and the calculation will be buggy.
      
      This patch add check in scale_load_down(), so the real weight will
      be >= MIN_SHARES after scale, after applied the group C wins as
      expected.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.comAcked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      9b83fd88
    • H
      sched/fair: Fix race between runtime distribution and assignment · 70a23044
      Huaixin Chang 提交于
      fix #25892693
      
      commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream
      
      Currently, there is a potential race between distribute_cfs_runtime()
      and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
      distributes without holding lock and finds out there is not enough
      runtime to charge against after distribution. Because
      assign_cfs_rq_runtime() might be called during distribution, and use
      cfs_b->runtime at the same time.
      
      Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
      and cfs period timer runs, slow threads might run and sleep, returning
      unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
      pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
      a lot. If runtime distributed is large too, over-use of runtime happens.
      
      A runtime over-using by about 70 percent of quota is seen when we
      test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
      95 slow threads in test group, configure 10ms quota for this group and
      see the CPU usage of fibtest is 17.0%, which is far from than the
      expected 10%.
      
      On a smaller machine with 32 cores, we also run fibtest with 96
      threads. CPU usage is more than 12%, which is also more than expected
      10%. This shows that on similar workloads, this race do affect CPU
      bandwidth control.
      
      Solve this by holding lock inside distribute_cfs_runtime().
      
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      70a23044
    • X
      alinux: cgroup: Fix task_css_check rcu warnings · 798cfa76
      Xunlei Pang 提交于
      to #26424323
      
      task_css() should be protected by rcu, fix several callers.
      
      Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
      Acked-by: NMichael Wang <yun.wany@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      798cfa76
  4. 17 4月, 2020 1 次提交
  5. 18 3月, 2020 10 次提交
  6. 17 1月, 2020 10 次提交
    • J
      io_uring: add support for pre-mapped user IO buffers · a078ed69
      Jens Axboe 提交于
      commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.
      
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a078ed69
    • J
      Add io_uring IO interface · 209d771f
      Jens Axboe 提交于
      commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e upstream.
      
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      209d771f
    • D
      signal: Add restore_user_sigmask() · d2d8b4c2
      Deepa Dinamani 提交于
      commit 854a6ed56839a40f6b5d02a2962f48841482eec4 upstream.
      
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d2d8b4c2
    • D
      signal: Add set_user_sigmask() · 2e132aa1
      Deepa Dinamani 提交于
      commit ded653ccbec0335a78fa7a7aff3ec9870349fafb upstream.
      
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2e132aa1
    • J
      alinux: psi: using cpuacct_cgrp_id under CONFIG_CGROUP_CPUACCT · 1bd8a72b
      Joseph Qi 提交于
      This is to fix the build error if CONFIG_CGROUP_CPUACCT is not enabled.
        kernel/sched/psi.c: In function 'iterate_groups':
        kernel/sched/psi.c:732:31: error: 'cpuacct_cgrp_id' undeclared (first use in this function); did you mean 'cpuacct_charge'?
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1bd8a72b
    • H
      alinux: sched/fair: use static load in wake_affine_weight · d2440c99
      Huaixin Chang 提交于
      For a long time runnable cpu load has been used in selecting task rq
      when waking up tasks. Recent test has shown for test load with a large
      quantity of short running tasks and almost full cpu utility, static load
      is more helpful.
      
      In our e2e tests, runnable load avg of java threads ranges from less than
      10 to as large as 362, while these java threads are no different from
      each other, and should be treated in the same way. After using static
      load, qps imporvement has been seen in multiple test cases.
      
      A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo
      WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load
      in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This
      feature is kept off by default.
      
      Test is done on the following hardware:
      
      4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
      
      In tests with 120 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	33170.63                34614.95 (+4.35%)
      
      In tests with 160 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	35888.71                38247.20 (+6.57%)
      
      In tests with 160 threads and sql loglevel configured to warn:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	39118.72                39698.72 (+1.48%)
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      d2440c99
    • K
      modsign: use all trusted keys to verify module signature · f7a573c3
      Ke Wu 提交于
      commit e84cd7ee630e44a2cc8ae49e85920a271b214cb3 upstream
      
      Make mod_verify_sig to use all trusted keys. This allows keys in
      secondary_trusted_keys to be used to verify PKCS#7 signature on a
      kernel module.
      Signed-off-by: NKe Wu <mikewu@google.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      f7a573c3
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
    • J
      alinux: introduce psi_v1 boot parameter · dc159a61
      Joseph Qi 提交于
      Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
      parameter psi_v1 to enable psi cgroup v1 support. Default it is
      disabled, which means when passing psi=1 boot parameter, we only support
      cgroup v2.
      This is to keep consistent with other cgroup v1 features such as cgroup
      writeback v1 (cgwb_v1).
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      dc159a61
    • X
      alinux: psi: Support PSI under cgroup v1 · 1f49a738
      Xunlei Pang 提交于
      Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f49a738
  7. 15 1月, 2020 8 次提交
    • Z
      alinux: fs,ext4: remove projid limit when create hard link · 08e6d768
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      08e6d768
    • B
      resource/docs: Complete kernel-doc style function documentation · f1f8f4a4
      Borislav Petkov 提交于
      commit f26621e60b35369bca9228bc936dc723b3e421af upstream.
      
      Add the missing kernel-doc style function parameters documentation.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: linux-tip-commits@vger.kernel.org
      Cc: rdunlap@infradead.org
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f1f8f4a4
    • R
      resource/docs: Fix new kernel-doc warnings · 275561cd
      Randy Dunlap 提交于
      commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.
      
      The first group of warnings is caused by a "/**" kernel-doc notation
      marker but the function comments are not in kernel-doc format.
      Also add another error return value here.
      
        ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
      
      Add the missing function parameter documentation for the other warnings:
      
        ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
        ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      275561cd
    • D
      mm/resource: Let walk_system_ram_range() search child resources · a9b17a5e
      Dave Hansen 提交于
      commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream
      
      In the process of onlining memory, we use walk_system_ram_range()
      to find the actual RAM areas inside of the area being onlined.
      
      However, it currently only finds memory resources which are
      "top-level" iomem_resources.  Children are not currently
      searched which causes it to skip System RAM in areas like this
      (in the format of /proc/iomem):
      
      a0000000-bfffffff : Persistent Memory (legacy)
        a0000000-afffffff : System RAM
      
      Changing the true->false here allows children to be searched
      as well.  We need this because we add a new "System RAM"
      resource underneath the "persistent memory" resource when
      we use persistent memory in a volatile mode.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      a9b17a5e
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · 0297fb96
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      0297fb96
    • D
      mm/resource: Return real error codes from walk failures · 88e75600
      Dave Hansen 提交于
      commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream
      
      walk_system_ram_range() can return an error code either becuase
      *it* failed, or because the 'func' that it calls returned an
      error.  The memory hotplug does the following:
      
      	ret = walk_system_ram_range(..., func);
              if (ret)
      		return ret;
      
      and 'ret' makes it out to userspace, eventually.  The problem
      s, walk_system_ram_range() failues that result from *it* failing
      (as opposed to 'func') return -1.  That leads to a very odd
      -EPERM (-1) return code out to userspace.
      
      Make walk_system_ram_range() return -EINVAL for internal
      failures to keep userspace less confused.
      
      This return code is compatible with all the callers that I
      audited.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      88e75600
    • O
      kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · be4a8d62
      Oscar Salvador 提交于
      commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream
      
      This is a preparation for the next patch.
      
      Currently, we only call release_mem_region_adjustable() in __remove_pages
      if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
      are being released by themselves with devm_release_mem_region.
      
      Since we do not want to touch any zone/page stuff during the removing of
      the memory (but during the offlining), we do not want to check for the
      zone here.  So we need another way to tell release_mem_region_adjustable()
      to not realease the resource in case it belongs to HMM/devm.
      
      HMM/devm acquires/releases a resource through
      devm_request_mem_region/devm_release_mem_region.
      
      These resources have the flag IORESOURCE_MEM, while resources acquired by
      hot-add memory path (register_memory_resource()) contain
      IORESOURCE_SYSTEM_RAM.
      
      So, we can check for this flag in release_mem_region_adjustable, and if
      the resource does not contain such flag, we know that we are dealing with
      a HMM/devm resource, so we can back off.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      be4a8d62
    • B
      resource: Clean it up a bit · 87e09e0a
      Borislav Petkov 提交于
      commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream
      
      - Drop BUG_ON()s and do normal error handling instead, in
        find_next_iomem_res().
      
      - Align function arguments on opening braces.
      
      - Get rid of local var sibling_only in find_next_iomem_res().
      
      - Shorten unnecessarily long first_level_children_only arg name.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Bjorn Helgaas <bhelgaas@google.com>
      CC: Brijesh Singh <brijesh.singh@amd.com>
      CC: Dan Williams <dan.j.williams@intel.com>
      CC: H. Peter Anvin <hpa@zytor.com>
      CC: Lianbo Jiang <lijiang@redhat.com>
      CC: Takashi Iwai <tiwai@suse.de>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Vivek Goyal <vgoyal@redhat.com>
      CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      CC: bhe@redhat.com
      CC: dan.j.williams@intel.com
      CC: dyoung@redhat.com
      CC: kexec@lists.infradead.org
      CC: mingo@redhat.com
      Link: <new submission>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      87e09e0a