1. 05 10月, 2019 1 次提交
  2. 04 10月, 2019 1 次提交
  3. 03 10月, 2019 1 次提交
    • V
      net: dsa: Remove unused __DSA_SKB_CB macro · 37048e94
      Vladimir Oltean 提交于
      The struct __dsa_skb_cb is supposed to span the entire 48-byte skb
      control block, while the struct dsa_skb_cb only the portion of it which
      is used by the DSA core (the rest is available as private data to
      drivers).
      
      The DSA_SKB_CB and __DSA_SKB_CB helpers are supposed to help retrieve
      this pointer based on a skb, but it turns out there is nobody directly
      interested in the struct __dsa_skb_cb in the kernel. So remove it.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37048e94
  4. 02 10月, 2019 5 次提交
  5. 29 9月, 2019 2 次提交
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
    • D
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      David Rientjes 提交于
      This reverts commit a8282608.
      
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
      
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
      
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
         and
      
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      intersocket.
      
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      back.
      
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac79f78d
  6. 28 9月, 2019 2 次提交
    • F
      sk_buff: drop all skb extensions on free and skb scrubbing · 174e2381
      Florian Westphal 提交于
      Now that we have a 3rd extension, add a new helper that drops the
      extension space and use it when we need to scrub an sk_buff.
      
      At this time, scrubbing clears secpath and bridge netfilter data, but
      retains the tc skb extension, after this patch all three get cleared.
      
      NAPI reuse/free assumes we can only have a secpath attached to skb, but
      it seems better to clear all extensions there as well.
      
      v2: add unlikely hint (Eric Dumazet)
      
      Fixes: 95a7233c ("net: openvswitch: Set OvS recirc_id from tc chain index")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      174e2381
    • J
      ptp: correctly disable flags on old ioctls · 2df4de16
      Jacob Keller 提交于
      Commit 41560658 ("PTP: introduce new versions of IOCTLs",
      2019-09-13) introduced new versions of the PTP ioctls which actually
      validate that the flags are acceptable values.
      
      As part of this, it cleared the flags value using a bitwise
      and+negation, in an attempt to prevent the old ioctl from accidentally
      enabling new features.
      
      This is incorrect for a couple of reasons. First, it results in
      accidentally preventing previously working flags on the request ioctl.
      By clearing the "valid" flags, we now no longer allow setting the
      enable, rising edge, or falling edge flags.
      
      Second, if we add new additional flags in the future, they must not be
      set by the old ioctl. (Since the flag wasn't checked before, we could
      potentially break userspace programs which sent garbage flag data.
      
      The correct way to resolve this is to check for and clear all but the
      originally valid flags.
      
      Create defines indicating which flags are correctly checked and
      interpreted by the original ioctls. Use these to clear any bits which
      will not be correctly interpreted by the original ioctls.
      
      In the future, new flags must be added to the VALID_FLAGS macros, but
      *not* to the V1_VALID_FLAGS macros. In this way, new features may be
      exposed over the v2 ioctls, but without breaking previous userspace
      which happened to not clear the flags value properly. The old ioctl will
      continue to behave the same way, while the new ioctl gains the benefit
      of using the flags fields.
      
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Christopher Hall <christopher.s.hall@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2df4de16
  7. 27 9月, 2019 4 次提交
    • E
      tcp: honor SO_PRIORITY in TIME_WAIT state · f6c0f5d2
      Eric Dumazet 提交于
      ctl packets sent on behalf of TIME_WAIT sockets currently
      have a zero skb->priority, which can cause various problems.
      
      In this patch we :
      
      - add a tw_priority field in struct inet_timewait_sock.
      
      - populate it from sk->sk_priority when a TIME_WAIT is created.
      
      - For IPv4, change ip_send_unicast_reply() and its two
        callers to propagate tw_priority correctly.
        ip_send_unicast_reply() no longer changes sk->sk_priority.
      
      - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
        field to tcp_v6_send_response() and tcp_v6_send_ack().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c0f5d2
    • E
      ipv6: add priority parameter to ip6_xmit() · 4f6570d7
      Eric Dumazet 提交于
      Currently, ip6_xmit() sets skb->priority based on sk->sk_priority
      
      This is not desirable for TCP since TCP shares the same ctl socket
      for a given netns. We want to be able to send RST or ACK packets
      with a non zero skb->priority.
      
      This patch has no functional change.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f6570d7
    • E
      sch_netem: fix rcu splat in netem_enqueue() · 159d2c7d
      Eric Dumazet 提交于
      qdisc_root() use from netem_enqueue() triggers a lockdep warning.
      
      __dev_queue_xmit() uses rcu_read_lock_bh() which is
      not equivalent to rcu_read_lock() + local_bh_disable_bh as far
      as lockdep is concerned.
      
      WARNING: suspicious RCU usage
      5.3.0-rc7+ #0 Not tainted
      -----------------------------
      include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syz-executor427/8855:
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
       #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838
      
      stack backtrace:
      CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
       qdisc_root include/net/sch_generic.h:492 [inline]
       netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
       __dev_xmit_skb net/core/dev.c:3527 [inline]
       __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
       ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
       ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
       udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
       udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:657
       ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
       __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
       do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      159d2c7d
    • M
      mm: treewide: clarify pgtable_page_{ctor,dtor}() naming · b4ed71f5
      Mark Rutland 提交于
      The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
      people, and until recently arm64 used these erroneously/pointlessly for
      other levels of page table.
      
      To make it incredibly clear that these only apply to the PTE level, and to
      align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
      to pgtable_pte_page_{ctor,dtor}().
      
      These changes were generated with the following shell script:
      
      ----
      git grep -lw 'pgtable_page_.tor' | while read FILE; do
          sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
          sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
      done
      ----
      
      ... with the documentation re-flowed to remain under 80 columns, and
      whitespace fixed up in macros to keep backslashes aligned.
      
      There should be no functional change as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4ed71f5
  8. 26 9月, 2019 20 次提交
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
    • K
      bug: move WARN_ON() "cut here" into exception handler · a44f71a9
      Kees Cook 提交于
      The original clean up of "cut here" missed the WARN_ON() case (that does
      not have a printk message), which was fixed recently by adding an explicit
      printk of "cut here".  This had the downside of adding a printk() to every
      WARN_ON() caller, which reduces the utility of using an instruction
      exception to streamline the resulting code.  By making this a new BUGFLAG,
      all of these can be removed and "cut here" can be handled by the exception
      handler.
      
      This was very pronounced on PowerPC, but the effect can be seen on x86 as
      well.  The resulting text size of a defconfig build shows some small
      savings from this patch:
      
         text    data     bss     dec     hex filename
      19691167        5134320 1646664 26472151        193eed7 vmlinux.before
      19676362        5134260 1663048 26473670        193f4c6 vmlinux.after
      
      This change also opens the door for creating something like BUG_MSG(),
      where a custom printk() before issuing BUG(), without confusing the "cut
      here" line.
      
      Link: http://lkml.kernel.org/r/201908200943.601DD59DCE@keescook
      Fixes: 6b15f678 ("include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a44f71a9
    • K
      bug: consolidate __WARN_FLAGS usage · 2da1ead4
      Kees Cook 提交于
      Instead of having separate tests for __WARN_FLAGS, merge the two #ifdef
      blocks and replace the synonym WANT_WARN_ON_SLOWPATH macro.
      
      Link: http://lkml.kernel.org/r/20190819234111.9019-7-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da1ead4
    • K
      bug: clean up helper macros to remove __WARN_TAINT() · d4bce140
      Kees Cook 提交于
      In preparation for cleaning up "cut here" even more, this removes the
      __WARN_*TAINT() helpers, as they limit the ability to add new BUGFLAG_*
      flags to call sites.  They are removed by expanding them into full
      __WARN_FLAGS() calls.
      
      Link: http://lkml.kernel.org/r/20190819234111.9019-6-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4bce140
    • K
      bug: consolidate warn_slowpath_fmt() usage · f2f84b05
      Kees Cook 提交于
      Instead of having a separate helper for no printk output, just consolidate
      the logic into warn_slowpath_fmt().
      
      Link: http://lkml.kernel.org/r/20190819234111.9019-4-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2f84b05
    • K
      bug: rename __WARN_printf_taint() to __WARN_printf() · 89348fc3
      Kees Cook 提交于
      This just renames the helper to improve readability.
      
      Link: http://lkml.kernel.org/r/20190819234111.9019-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89348fc3
    • K
      bug: refactor away warn_slowpath_fmt_taint() · ee871133
      Kees Cook 提交于
      Patch series "Clean up WARN() "cut here" handling", v2.
      
      Christophe Leroy noticed that the fix for missing "cut here" in the WARN()
      case was adding explicit printk() calls instead of teaching the exception
      handler to add it.  This refactors the bug/warn infrastructure to pass
      this information as a new BUGFLAG.
      
      Longer details repeated from the last patch in the series:
      
      bug: move WARN_ON() "cut here" into exception handler
      
      The original cleanup of "cut here" missed the WARN_ON() case (that does
      not have a printk message), which was fixed recently by adding an explicit
      printk of "cut here".  This had the downside of adding a printk() to every
      WARN_ON() caller, which reduces the utility of using an instruction
      exception to streamline the resulting code.  By making this a new BUGFLAG,
      all of these can be removed and "cut here" can be handled by the exception
      handler.
      
      This was very pronounced on PowerPC, but the effect can be seen on x86 as
      well.  The resulting text size of a defconfig build shows some small
      savings from this patch:
      
         text    data     bss     dec     hex filename
      19691167        5134320 1646664 26472151        193eed7 vmlinux.before
      19676362        5134260 1663048 26473670        193f4c6 vmlinux.after
      
      This change also opens the door for creating something like BUG_MSG(),
      where a custom printk() before issuing BUG(), without confusing the "cut
      here" line.
      
      This patch (of 7):
      
      There's no reason to have specialized helpers for passing the warn taint
      down to __warn().  Consolidate and refactor helper macros, removing
      __WARN_printf() and warn_slowpath_fmt_taint().
      
      Link: http://lkml.kernel.org/r/20190819234111.9019-2-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Drew Davenport <ddavenport@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee871133
    • D
      kgdb: don't use a notifier to enter kgdb at panic; call directly · 7d92bda2
      Douglas Anderson 提交于
      Right now kgdb/kdb hooks up to debug panics by registering for the panic
      notifier.  This works OK except that it means that kgdb/kdb gets called
      _after_ the CPUs in the system are taken offline.  That means that if
      anything important was happening on those CPUs (like something that might
      have contributed to the panic) you can't debug them.
      
      Specifically I ran into a case where I got a panic because a task was
      "blocked for more than 120 seconds" which was detected on CPU 2.  I nicely
      got shown stack traces in the kernel log for all CPUs including CPU 0,
      which was running 'PID: 111 Comm: kworker/0:1H' and was in the middle of
      __mmc_switch().
      
      I then ended up at the kdb prompt where switched over to kgdb to try to
      look at local variables of the process on CPU 0.  I found that I couldn't.
      Digging more, I found that I had no info on any tasks running on CPUs
      other than CPU 2 and that asking kdb for help showed me "Error: no saved
      data for this cpu".  This was because all the CPUs were offline.
      
      Let's move the entry of kdb/kgdb to a direct call from panic() and stop
      using the generic notifier.  Putting a direct call in allows us to order
      things more properly and it also doesn't seem like we're breaking any
      abstractions by calling into the debugger from the panic function.
      
      Daniel said:
      
      : This patch changes the way kdump and kgdb interact with each other.
      : However it would seem rather odd to have both tools simultaneously armed
      : and, even if they were, the user still has the option to use panic_timeout
      : to force a kdump to happen.  Thus I think the change of order is
      : acceptable.
      
      Link: http://lkml.kernel.org/r/20190703170354.217312-1-dianders@chromium.orgSigned-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d92bda2
    • K
      uaccess: add missing __must_check attributes · 9dd819a1
      Kees Cook 提交于
      The usercopy implementation comments describe that callers of the
      copy_*_user() family of functions must always have their return values
      checked.  This can be enforced at compile time with __must_check, so add
      it where needed.
      
      Link: http://lkml.kernel.org/r/201908251609.ADAD5CAAC1@keescookSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dd819a1
    • V
      kexec: restore arch_kexec_kernel_image_probe declaration · d5372c39
      Vasily Gorbik 提交于
      arch_kexec_kernel_image_probe function declaration has been removed by
      commit 9ec4ecef ("kexec_file,x86,powerpc: factor out kexec_file_ops
      functions").  Still this function is overridden by couple of architectures
      and proper prototype declaration is therefore important, so bring it back.
      This fixes the following sparse warning on s390:
      arch/s390/kernel/machine_kexec_file.c:333:5: warning: symbol
      'arch_kexec_kernel_image_probe' was not declared.  Should it be static?
      
      Link: http://lkml.kernel.org/r/patch.git-ff1c9045ebdc.your-ad-here.call-01564402297-ext-5690@work.hoursSigned-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Reviewed-by: NBhupesh Sharma <bhsharma@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5372c39
    • A
      cpumask: nicer for_each_cpumask_and() signature · 2a4a4082
      Alexey Dobriyan 提交于
      Mask arguments can be swapped without changing anything.  Make arguments
      names reflect that:
      
      	#define for_each_cpu_and(cpu, mask1, mask2)
      
      Link: http://lkml.kernel.org/r/20190724183350.GA15041@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a4a4082
    • S
      fork: improve error message for corrupted page tables · 8495f7e6
      Sai Praneeth Prakhya 提交于
      When a user process exits, the kernel cleans up the mm_struct of the user
      process and during cleanup, check_mm() checks the page tables of the user
      process for corruption (E.g: unexpected page flags set/cleared).  For
      corrupted page tables, the error message printed by check_mm() isn't very
      clear as it prints the loop index instead of page table type (E.g:
      Resident file mapping pages vs Resident shared memory pages).  The loop
      index in check_mm() is used to index rss_stat[] which represents
      individual memory type stats.  Hence, instead of printing index, print
      memory type, thereby improving error message.
      
      Without patch:
      --------------
      [  204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
      [  204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
      [  204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
      [  204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480
      
      With patch:
      -----------
      [   69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
      [   69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
      [   69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
      [   69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480
      
      Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
      that it matches the other print statement.
      
      Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.comSigned-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8495f7e6
    • S
      lib/hexdump: make print_hex_dump_bytes() a nop on !DEBUG builds · 091cb099
      Stephen Boyd 提交于
      I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
      in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
      have DEBUG defined in my build.  The problem is that
      print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
      calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
      consider here
      
        1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
        2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
        3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out
      
      Right now, that last case isn't detected and we still call
      print_hex_dump() from the stub wrapper.
      
      Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
      it works properly in all cases.
      
      Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
      behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
      KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
      is a nop, changing behavior to what we want, i.e.  print nothing.
      
      Link: http://lkml.kernel.org/r/20190816235624.115280-1-swboyd@chromium.orgSigned-off-by: NStephen Boyd <swboyd@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      091cb099
    • Q
      include/trace/events/writeback.h: fix -Wstringop-truncation warnings · d1a445d3
      Qian Cai 提交于
      There are many of those warnings.
      
      In file included from ./arch/powerpc/include/asm/paca.h:15,
                       from ./arch/powerpc/include/asm/current.h:13,
                       from ./include/linux/thread_info.h:21,
                       from ./include/asm-generic/preempt.h:5,
                       from ./arch/powerpc/include/generated/asm/preempt.h:1,
                       from ./include/linux/preempt.h:78,
                       from ./include/linux/spinlock.h:51,
                       from fs/fs-writeback.c:19:
      In function 'strncpy',
          inlined from 'perf_trace_writeback_page_template' at
      ./include/trace/events/writeback.h:56:1:
      ./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
      bound 32 equals destination size [-Wstringop-truncation]
        return __builtin_strncpy(p, q, size);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix it by using the new strscpy_pad() which was introduced in "lib/string:
      Add strscpy_pad() function" and will always be NUL-terminated instead of
      strncpy().  Also, change strlcpy() to use strscpy_pad() in this file for
      consistency.
      
      Link: http://lkml.kernel.org/r/1564075099-27750-1-git-send-email-cai@lca.pw
      Fixes: 455b2864 ("writeback: Initial tracing support")
      Fixes: 028c2dd1 ("writeback: Add tracing to balance_dirty_pages")
      Fixes: e84d0a4f ("writeback: trace event writeback_queue_io")
      Fixes: b48c104d ("writeback: trace event bdi_dirty_ratelimit")
      Fixes: cc1676d9 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
      Fixes: 9fb0a7da ("writeback: add more tracepoints")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Nitin Gote <nitin.r.gote@intel.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Stephen Kitt <steve@sk2.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1a445d3
    • J
      kernel-doc: core-api: include string.h into core-api · 917cda27
      Joe Perches 提交于
      core-api should show all the various string functions including the newly
      added stracpy and stracpy_pad.
      
      Miscellanea:
      
      o Update the Returns: value for strscpy
      o fix a defect with %NUL)
      
      [joe@perches.com: correct return of -E2BIG descriptions]
        Link: http://lkml.kernel.org/r/29f998b4c1a9d69fbeae70500ba0daa4b340c546.1563889130.git.joe@perches.com
      Link: http://lkml.kernel.org/r/224a6ebf39955f4107c0c376d66155d970e46733.1563841972.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Stephen Kitt <steve@sk2.org>
      Cc: Nitin Gote <nitin.r.gote@intel.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917cda27
    • M
      augmented rbtree: rework the RB_DECLARE_CALLBACKS macro definition · 6d2052d1
      Michel Lespinasse 提交于
      Change the definition of the RBCOMPUTE function.  The propagate callback
      repeatedly calls RBCOMPUTE as it moves from leaf to root.  it wants to
      stop recomputing once the augmented subtree information doesn't change.
      This was previously checked using the == operator, but that only works
      when the augmented subtree information is a scalar field.  This commit
      modifies the RBCOMPUTE function so that it now sets the augmented subtree
      information instead of returning it, and returns a boolean value
      indicating if the propagate callback should stop.
      
      The motivation for this change is that I want to introduce augmented
      rbtree uses where the augmented data for the subtree is a struct instead
      of a scalar.
      
      Link: http://lkml.kernel.org/r/20190703040156.56953-4-walken@google.comSigned-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d2052d1
    • M
      augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro · 315cc066
      Michel Lespinasse 提交于
      Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
      for the case where the augmented value is a scalar whose definition
      follows a max(f(node)) pattern.  This actually covers all present uses of
      RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
      various RBCOMPUTE function definitions.
      
      [walken@google.com: fix mm/vmalloc.c]
        Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
      [walken@google.com: re-add check to check_augmented()]
        Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
      Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.comSigned-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      315cc066
    • M
      augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro · 444b8a83
      Michel Lespinasse 提交于
      Patch series "make RB_DECLARE_CALLBACKS more generic", v3.
      
      These changes are intended to make the RB_DECLARE_CALLBACKS macro more
      generic (allowing the aubmented subtree information to be a struct instead
      of a scalar).
      
      I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
      check that they didn't change.  This held as expected for interval_tree.o;
      mmap.o did have some changes which could be reverted by marking
      __vma_link_rb as noinline.  I did not add such a change to the patchset; I
      felt it was reasonable enough to leave the inlining decision up to the
      compiler.
      
      This patch (of 3):
      
      Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
      The arguments are also now capitalized.  This copies the style of the
      INTERVAL_TREE_DEFINE macro.
      
      No functional changes in this commit, only comments and capitalization.
      
      Link: http://lkml.kernel.org/r/20190703040156.56953-2-walken@google.comSigned-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444b8a83
    • M
      linux/coff.h: add include guard · 541be050
      Masahiro Yamada 提交于
      Add a header include guard just in case.
      
      My motivation is to allow Kbuild to detect missing include guard:
      
      https://patchwork.kernel.org/patch/11063011/
      
      Before I enable this checker I want to fix as many headers as possible.
      
      Link: http://lkml.kernel.org/r/20190728154728.11126-1-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      541be050
  9. 25 9月, 2019 4 次提交
    • M
      sched/membarrier: Fix p->mm->membarrier_state racy load · 227a4aad
      Mathieu Desnoyers 提交于
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      227a4aad
    • M
      sched/membarrier: Call sync_core only before usermode for same mm · 2840cf02
      Mathieu Desnoyers 提交于
      When the prev and next task's mm change, switch_mm() provides the core
      serializing guarantees before returning to usermode. The only case
      where an explicit core serialization is needed is when the scheduler
      keeps the same mm for prev and next.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2840cf02
    • E
      tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code · 154abafc
      Eric W. Biederman 提交于
      Remove work arounds that were written before there was a grace period
      after tasks left the runqueue in finish_task_switch().
      
      In particular now that there tasks exiting the runqueue exprience
      a RCU grace period none of the work performed by task_rcu_dereference()
      excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
      with rcu_dereference().
      
      Remove the code in rcuwait_wait_event() that checks to ensure the current
      task has not exited.  It is no longer necessary as it is guaranteed
      that any running task will experience a RCU grace period after it
      leaves the run queueue.
      
      Remove the comment in rcuwait_wake_up() as it is no longer relevant.
      
      Ref: 8f95c90c ("sched/wait, RCU: Introduce rcuwait machinery")
      Ref: 150593bf ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      154abafc
    • E
      tasks: Add a count of task RCU users · 3fbd7ee2
      Eric W. Biederman 提交于
      Add a count of the number of RCU users (currently 1) of the task
      struct so that we can later add the scheduler case and get rid of the
      very subtle task_rcu_dereference(), and just use rcu_dereference().
      
      As suggested by Oleg have the count overlap rcu_head so that no
      additional space in task_struct is required.
      Inspired-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Inspired-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3fbd7ee2