1. 21 10月, 2019 4 次提交
  2. 18 10月, 2019 1 次提交
    • D
      iomap: iomap that extends beyond EOF should be marked dirty · 7684e2c4
      Dave Chinner 提交于
      When doing a direct IO that spans the current EOF, and there are
      written blocks beyond EOF that extend beyond the current write, the
      only metadata update that needs to be done is a file size extension.
      
      However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
      there is IO completion metadata updates required, and hence we may
      fail to correctly sync file size extensions made in IO completion
      when O_DSYNC writes are being used and the hardware supports FUA.
      
      Hence when setting IOMAP_F_DIRTY, we need to also take into account
      whether the iomap spans the current EOF. If it does, then we need to
      mark it dirty so that IO completion will call generic_write_sync()
      to flush the inode size update to stable storage correctly.
      
      Fixes: 3460cac1 ("iomap: Use FUA for pure data O_DSYNC DIO writes")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: removed the ext4 part; they'll handle it separately]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7684e2c4
  3. 15 10月, 2019 1 次提交
  4. 12 10月, 2019 1 次提交
  5. 11 10月, 2019 1 次提交
    • B
      SUNRPC: fix race to sk_err after xs_error_report · af84537d
      Benjamin Coddington 提交于
      Since commit 4f8943f8 ("SUNRPC: Replace direct task wakeups from
      softirq context") there has been a race to the value of the sk_err if both
      XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set.  In that case,
      we may end up losing the sk_err value that existed when xs_error_report was
      called.
      
      Fix this by reverting to the previous behavior: instead of using SO_ERROR
      to retrieve the value at a later time (which might also return sk_err_soft),
      copy the sk_err value onto struct sock_xprt, and use that value to wake
      pending tasks.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Fixes: 4f8943f8 ("SUNRPC: Replace direct task wakeups from softirq context")
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      af84537d
  6. 09 10月, 2019 1 次提交
  7. 08 10月, 2019 9 次提交
    • A
      lib/string: Make memzero_explicit() inline instead of external · bec50077
      Arvind Sankar 提交于
      With the use of the barrier implied by barrier_data(), there is no need
      for memzero_explicit() to be extern. Making it inline saves the overhead
      of a function call, and allows the code to be reused in arch/*/purgatory
      without having to duplicate the implementation.
      Tested-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Reviewed-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H . Peter Anvin <hpa@zytor.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-crypto@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Fixes: 906a4bb9 ("crypto: sha256 - Use get/put_unaligned_be32 to get input, memzero_explicit")
      Link: https://lkml.kernel.org/r/20191007220000.GA408752@rani.riverdale.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bec50077
    • V
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka 提交于
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • B
      memcg: only record foreign writebacks with dirty pages when memcg is not disabled · 08d1d0e6
      Baoquan He 提交于
      In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
      for saving memory.  Now kdump kernel will always panic when dump vmcore
      to local disk:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000ab8
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
        RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
        Call Trace:
         __set_page_dirty+0x52/0xc0
         iomap_set_page_dirty+0x50/0x90
         iomap_write_end+0x6e/0x270
         iomap_write_actor+0xce/0x170
         iomap_apply+0xba/0x11e
         iomap_file_buffered_write+0x62/0x90
         xfs_file_buffered_aio_write+0xca/0x320 [xfs]
         new_sync_write+0x12d/0x1d0
         vfs_write+0xa5/0x1a0
         ksys_write+0x59/0xd0
         do_syscall_64+0x59/0x1e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
      
      Via the trace and with debugging, it is pointing to commit 97b27821
      ("writeback, memcg: Implement foreign dirty flushing") which introduced
      this regression.  Disabling memcg causes the null pointer dereference at
      uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
      
      Fix it by returning directly if memcg is disabled, but not trying to
      record the foreign writebacks with dirty pages.
      
      Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
      Fixes: 97b27821 ("writeback, memcg: Implement foreign dirty flushing")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d1d0e6
    • L
      uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to it · c512c691
      Linus Torvalds 提交于
      In commit 9f79b78e ("Convert filldir[64]() from __put_user() to
      unsafe_put_user()") I made filldir() use unsafe_put_user(), which
      improves code generation on x86 enormously.
      
      But because we didn't have a "unsafe_copy_to_user()", the dirent name
      copy was also done by hand with unsafe_put_user() in a loop, and it
      turns out that a lot of other architectures didn't like that, because
      unlike x86, they have various alignment issues.
      
      Most non-x86 architectures trap and fix it up, and some (like xtensa)
      will just fail unaligned put_user() accesses unconditionally.  Which
      makes that "copy using put_user() in a loop" not work for them at all.
      
      I could make that code do explicit alignment etc, but the architectures
      that don't like unaligned accesses also don't really use the fancy
      "user_access_begin/end()" model, so they might just use the regular old
      __copy_to_user() interface.
      
      So this commit takes that looping implementation, turns it into the x86
      version of "unsafe_copy_to_user()", and makes other architectures
      implement the unsafe copy version as __copy_to_user() (the same way they
      do for the other unsafe_xyz() accessor functions).
      
      Note that it only does this for the copying _to_ user space, and we
      still don't have a unsafe version of copy_from_user().
      
      That's partly because we have no current users of it, but also partly
      because the copy_from_user() case is slightly different and cannot
      efficiently be implemented in terms of a unsafe_get_user() loop (because
      gcc can't do asm goto with outputs).
      
      It would be trivial to do this using "rep movsb", which would work
      really nicely on newer x86 cores, but really badly on some older ones.
      
      Al Viro is looking at cleaning up all our user copy routines to make
      this all a non-issue, but for now we have this simple-but-stupid version
      for x86 that works fine for the dirent name copy case because those
      names are short strings and we simply don't need anything fancier.
      
      Fixes: 9f79b78e ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Reported-and-tested-by: NTony Luck <tony.luck@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c512c691
    • M
      module: rename __kstrtab_ns_* to __kstrtabns_* to avoid symbol conflict · fa6643cd
      Masahiro Yamada 提交于
      The module namespace produces __strtab_ns_<sym> symbols to store
      namespace strings, but it does not guarantee the name uniqueness.
      This is a potential problem because we have exported symbols starting
      with "ns_".
      
      For example, kernel/capability.c exports the following symbols:
      
        EXPORT_SYMBOL(ns_capable);
        EXPORT_SYMBOL(capable);
      
      Assume a situation where those are converted as follows:
      
        EXPORT_SYMBOL_NS(ns_capable, some_namespace);
        EXPORT_SYMBOL_NS(capable, some_namespace);
      
      The former expands to "__kstrtab_ns_capable" and "__kstrtab_ns_ns_capable",
      and the latter to "__kstrtab_capable" and "__kstrtab_ns_capable".
      Then, we have the duplicated "__kstrtab_ns_capable".
      
      To ensure the uniqueness, rename "__kstrtab_ns_*" to "__kstrtabns_*".
      Reviewed-by: NMatthias Maennich <maennich@google.com>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      fa6643cd
    • M
      module: swap the order of symbol.namespace · bf70b050
      Masahiro Yamada 提交于
      Currently, EXPORT_SYMBOL_NS(_GPL) constructs the kernel symbol as
      follows:
      
        __ksymtab_SYMBOL.NAMESPACE
      
      The sym_extract_namespace() in modpost allocates memory for the part
      SYMBOL.NAMESPACE when '.' is contained. One problem is that the pointer
      returned by strdup() is lost because the symbol name will be copied to
      malloc'ed memory by alloc_symbol(). No one will keep track of the
      pointer of strdup'ed memory.
      
      sym->namespace still points to the NAMESPACE part. So, you can free it
      with complicated code like this:
      
         free(sym->namespace - strlen(sym->name) - 1);
      
      It complicates memory free.
      
      To fix it elegantly, I swapped the order of the symbol and the
      namespace as follows:
      
        __ksymtab_NAMESPACE.SYMBOL
      
      then, simplified sym_extract_namespace() so that it allocates memory
      only for the NAMESPACE part.
      
      I prefer this order because it is intuitive and also matches to major
      languages. For example, NAMESPACE::NAME in C++, MODULE.NAME in Python.
      Reviewed-by: NMatthias Maennich <maennich@google.com>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      bf70b050
  8. 07 10月, 2019 3 次提交
  9. 05 10月, 2019 3 次提交
  10. 03 10月, 2019 1 次提交
    • V
      net: dsa: sja1105: Fix sleeping while atomic in .port_hwtstamp_set · 3e8db7e5
      Vladimir Oltean 提交于
      Currently this stack trace can be seen with CONFIG_DEBUG_ATOMIC_SLEEP=y:
      
      [   41.568348] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
      [   41.576757] in_atomic(): 1, irqs_disabled(): 0, pid: 208, name: ptp4l
      [   41.583212] INFO: lockdep is turned off.
      [   41.587123] CPU: 1 PID: 208 Comm: ptp4l Not tainted 5.3.0-rc6-01445-ge950f2d4bc7f-dirty #1827
      [   41.599873] [<c0313d7c>] (unwind_backtrace) from [<c030e13c>] (show_stack+0x10/0x14)
      [   41.607584] [<c030e13c>] (show_stack) from [<c1212d50>] (dump_stack+0xd4/0x100)
      [   41.614863] [<c1212d50>] (dump_stack) from [<c037dfc8>] (___might_sleep+0x1c8/0x2b4)
      [   41.622574] [<c037dfc8>] (___might_sleep) from [<c122ea90>] (__mutex_lock+0x48/0xab8)
      [   41.630368] [<c122ea90>] (__mutex_lock) from [<c122f51c>] (mutex_lock_nested+0x1c/0x24)
      [   41.638340] [<c122f51c>] (mutex_lock_nested) from [<c0c6fe08>] (sja1105_static_config_reload+0x30/0x27c)
      [   41.647779] [<c0c6fe08>] (sja1105_static_config_reload) from [<c0c7015c>] (sja1105_hwtstamp_set+0x108/0x1cc)
      [   41.657562] [<c0c7015c>] (sja1105_hwtstamp_set) from [<c0feb650>] (dev_ifsioc+0x18c/0x330)
      [   41.665788] [<c0feb650>] (dev_ifsioc) from [<c0febbd8>] (dev_ioctl+0x320/0x6e8)
      [   41.673064] [<c0febbd8>] (dev_ioctl) from [<c0f8b1f4>] (sock_ioctl+0x334/0x5e8)
      [   41.680340] [<c0f8b1f4>] (sock_ioctl) from [<c05404a8>] (do_vfs_ioctl+0xb0/0xa10)
      [   41.687789] [<c05404a8>] (do_vfs_ioctl) from [<c0540e3c>] (ksys_ioctl+0x34/0x58)
      [   41.695151] [<c0540e3c>] (ksys_ioctl) from [<c0301000>] (ret_fast_syscall+0x0/0x28)
      [   41.702768] Exception stack(0xe8495fa8 to 0xe8495ff0)
      [   41.707796] 5fa0:                   beff4a8c 00000001 00000011 000089b0 beff4a8c beff4a80
      [   41.715933] 5fc0: beff4a8c 00000001 0000000c 00000036 b6fa98c8 004e19c1 00000001 00000000
      [   41.724069] 5fe0: 004dcedc beff4a6c 004c0738 b6e7af4c
      [   41.729860] BUG: scheduling while atomic: ptp4l/208/0x00000002
      [   41.735682] INFO: lockdep is turned off.
      
      Enabling RX timestamping will logically disturb the fastpath (processing
      of meta frames). Replace bool hwts_rx_en with a bit that is checked
      atomically from the fastpath and temporarily unset from the sleepable
      context during a change of the RX timestamping process (a destructive
      operation anyways, requires switch reset).
      If found unset, the fastpath (net/dsa/tag_sja1105.c) will just drop any
      received meta frame and not take the meta_lock at all.
      
      Fixes: a602afd2 ("net: dsa: sja1105: Expose PTP timestamping ioctls to userspace")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e8db7e5
  11. 02 10月, 2019 2 次提交
  12. 01 10月, 2019 2 次提交
    • A
      lib: introduce copy_struct_from_user() helper · f5a1a536
      Aleksa Sarai 提交于
      A common pattern for syscall extensions is increasing the size of a
      struct passed from userspace, such that the zero-value of the new fields
      result in the old kernel behaviour (allowing for a mix of userspace and
      kernel vintages to operate on one another in most cases).
      
      While this interface exists for communication in both directions, only
      one interface is straightforward to have reasonable semantics for
      (userspace passing a struct to the kernel). For kernel returns to
      userspace, what the correct semantics are (whether there should be an
      error if userspace is unaware of a new extension) is very
      syscall-dependent and thus probably cannot be unified between syscalls
      (a good example of this problem is [1]).
      
      Previously there was no common lib/ function that implemented
      the necessary extension-checking semantics (and different syscalls
      implemented them slightly differently or incompletely[2]). Future
      patches replace common uses of this pattern to make use of
      copy_struct_from_user().
      
      Some in-kernel selftests that insure that the handling of alignment and
      various byte patterns are all handled identically to memchr_inv() usage.
      
      [1]: commit 1251201c ("sched/core: Fix uclamp ABI bug, clean up and
           robustify sched_read_attr() ABI logic and code")
      
      [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
           similar checks to copy_struct_from_user() while rt_sigprocmask(2)
           always rejects differently-sized struct arguments.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20191001011055.19283-2-cyphar@cyphar.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      f5a1a536
    • P
      kvm: x86, powerpc: do not allow clearing largepages debugfs entry · 833b45de
      Paolo Bonzini 提交于
      The largepages debugfs entry is incremented/decremented as shadow
      pages are created or destroyed.  Clearing it will result in an
      underflow, which is harmless to KVM but ugly (and could be
      misinterpreted by tools that use debugfs information), so make
      this particular statistic read-only.
      
      Cc: kvm-ppc@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      833b45de
  13. 29 9月, 2019 2 次提交
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
    • D
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      David Rientjes 提交于
      This reverts commit a8282608.
      
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
      
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
      
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
         and
      
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      intersocket.
      
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      back.
      
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac79f78d
  14. 28 9月, 2019 1 次提交
    • F
      sk_buff: drop all skb extensions on free and skb scrubbing · 174e2381
      Florian Westphal 提交于
      Now that we have a 3rd extension, add a new helper that drops the
      extension space and use it when we need to scrub an sk_buff.
      
      At this time, scrubbing clears secpath and bridge netfilter data, but
      retains the tc skb extension, after this patch all three get cleared.
      
      NAPI reuse/free assumes we can only have a secpath attached to skb, but
      it seems better to clear all extensions there as well.
      
      v2: add unlikely hint (Eric Dumazet)
      
      Fixes: 95a7233c ("net: openvswitch: Set OvS recirc_id from tc chain index")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      174e2381
  15. 27 9月, 2019 1 次提交
    • M
      mm: treewide: clarify pgtable_page_{ctor,dtor}() naming · b4ed71f5
      Mark Rutland 提交于
      The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
      people, and until recently arm64 used these erroneously/pointlessly for
      other levels of page table.
      
      To make it incredibly clear that these only apply to the PTE level, and to
      align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
      to pgtable_pte_page_{ctor,dtor}().
      
      These changes were generated with the following shell script:
      
      ----
      git grep -lw 'pgtable_page_.tor' | while read FILE; do
          sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
          sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
      done
      ----
      
      ... with the documentation re-flowed to remain under 80 columns, and
      whitespace fixed up in macros to keep backslashes aligned.
      
      There should be no functional change as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4ed71f5
  16. 26 9月, 2019 7 次提交
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
    • D
      kgdb: don't use a notifier to enter kgdb at panic; call directly · 7d92bda2
      Douglas Anderson 提交于
      Right now kgdb/kdb hooks up to debug panics by registering for the panic
      notifier.  This works OK except that it means that kgdb/kdb gets called
      _after_ the CPUs in the system are taken offline.  That means that if
      anything important was happening on those CPUs (like something that might
      have contributed to the panic) you can't debug them.
      
      Specifically I ran into a case where I got a panic because a task was
      "blocked for more than 120 seconds" which was detected on CPU 2.  I nicely
      got shown stack traces in the kernel log for all CPUs including CPU 0,
      which was running 'PID: 111 Comm: kworker/0:1H' and was in the middle of
      __mmc_switch().
      
      I then ended up at the kdb prompt where switched over to kgdb to try to
      look at local variables of the process on CPU 0.  I found that I couldn't.
      Digging more, I found that I had no info on any tasks running on CPUs
      other than CPU 2 and that asking kdb for help showed me "Error: no saved
      data for this cpu".  This was because all the CPUs were offline.
      
      Let's move the entry of kdb/kgdb to a direct call from panic() and stop
      using the generic notifier.  Putting a direct call in allows us to order
      things more properly and it also doesn't seem like we're breaking any
      abstractions by calling into the debugger from the panic function.
      
      Daniel said:
      
      : This patch changes the way kdump and kgdb interact with each other.
      : However it would seem rather odd to have both tools simultaneously armed
      : and, even if they were, the user still has the option to use panic_timeout
      : to force a kdump to happen.  Thus I think the change of order is
      : acceptable.
      
      Link: http://lkml.kernel.org/r/20190703170354.217312-1-dianders@chromium.orgSigned-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d92bda2
    • K
      uaccess: add missing __must_check attributes · 9dd819a1
      Kees Cook 提交于
      The usercopy implementation comments describe that callers of the
      copy_*_user() family of functions must always have their return values
      checked.  This can be enforced at compile time with __must_check, so add
      it where needed.
      
      Link: http://lkml.kernel.org/r/201908251609.ADAD5CAAC1@keescookSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dd819a1
    • V
      kexec: restore arch_kexec_kernel_image_probe declaration · d5372c39
      Vasily Gorbik 提交于
      arch_kexec_kernel_image_probe function declaration has been removed by
      commit 9ec4ecef ("kexec_file,x86,powerpc: factor out kexec_file_ops
      functions").  Still this function is overridden by couple of architectures
      and proper prototype declaration is therefore important, so bring it back.
      This fixes the following sparse warning on s390:
      arch/s390/kernel/machine_kexec_file.c:333:5: warning: symbol
      'arch_kexec_kernel_image_probe' was not declared.  Should it be static?
      
      Link: http://lkml.kernel.org/r/patch.git-ff1c9045ebdc.your-ad-here.call-01564402297-ext-5690@work.hoursSigned-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Reviewed-by: NBhupesh Sharma <bhsharma@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5372c39
    • A
      cpumask: nicer for_each_cpumask_and() signature · 2a4a4082
      Alexey Dobriyan 提交于
      Mask arguments can be swapped without changing anything.  Make arguments
      names reflect that:
      
      	#define for_each_cpu_and(cpu, mask1, mask2)
      
      Link: http://lkml.kernel.org/r/20190724183350.GA15041@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a4a4082
    • S
      fork: improve error message for corrupted page tables · 8495f7e6
      Sai Praneeth Prakhya 提交于
      When a user process exits, the kernel cleans up the mm_struct of the user
      process and during cleanup, check_mm() checks the page tables of the user
      process for corruption (E.g: unexpected page flags set/cleared).  For
      corrupted page tables, the error message printed by check_mm() isn't very
      clear as it prints the loop index instead of page table type (E.g:
      Resident file mapping pages vs Resident shared memory pages).  The loop
      index in check_mm() is used to index rss_stat[] which represents
      individual memory type stats.  Hence, instead of printing index, print
      memory type, thereby improving error message.
      
      Without patch:
      --------------
      [  204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
      [  204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
      [  204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
      [  204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480
      
      With patch:
      -----------
      [   69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
      [   69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
      [   69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
      [   69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480
      
      Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
      that it matches the other print statement.
      
      Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.comSigned-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8495f7e6