1. 08 8月, 2020 13 次提交
    • R
      mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h · 0f876e4d
      Roman Gushchin 提交于
      To make the memcg_kmem_bypass() function available outside of the
      memcontrol.c, let's move it to memcontrol.h.  The function is small and
      nicely fits into static inline sort of functions.
      
      It will be used from the slab code.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f876e4d
    • R
      mm: memcg/slab: save obj_cgroup for non-root slab objects · 964d4bd3
      Roman Gushchin 提交于
      Store the obj_cgroup pointer in the corresponding place of
      page->obj_cgroups for each allocated non-root slab object.  Make sure that
      each allocated object holds a reference to obj_cgroup.
      
      Objcg pointer is obtained from the memcg->objcg dereferencing in
      memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
      Then in case of successful allocation(s) it's getting stored in the
      page->obj_cgroups vector.
      
      The objcg obtaining part look a bit bulky now, but it will be simplified
      by next commits in the series.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964d4bd3
    • R
      mm: memcg/slab: allocate obj_cgroups for non-root slab pages · 286e04b8
      Roman Gushchin 提交于
      Allocate and release memory to store obj_cgroup pointers for each non-root
      slab page. Reuse page->mem_cgroup pointer to store a pointer to the
      allocated space.
      
      This commit temporarily increases the memory footprint of the kernel memory
      accounting. To store obj_cgroup pointers we'll need a place for an
      objcg_pointer for each allocated object. However, the following patches
      in the series will enable sharing of slab pages between memory cgroups,
      which will dramatically increase the total slab utilization. And the final
      memory footprint will be significantly smaller than before.
      
      To distinguish between obj_cgroups and memcg pointers in case when it's
      not obvious which one is used (as in page_cgroup_ino()), let's always set
      the lowest bit in the obj_cgroup case. The original obj_cgroups
      pointer is marked to be ignored by kmemleak, which otherwise would
      report a memory leak for each allocated vector.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      286e04b8
    • R
      mm: memcg/slab: obj_cgroup API · bf4f0599
      Roman Gushchin 提交于
      Obj_cgroup API provides an ability to account sub-page sized kernel
      objects, which potentially outlive the original memory cgroup.
      
      The top-level API consists of the following functions:
        bool obj_cgroup_tryget(struct obj_cgroup *objcg);
        void obj_cgroup_get(struct obj_cgroup *objcg);
        void obj_cgroup_put(struct obj_cgroup *objcg);
      
        int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
        void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
      
        struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
        struct obj_cgroup *get_obj_cgroup_from_current(void);
      
      Object cgroup is basically a pointer to a memory cgroup with a per-cpu
      reference counter.  It substitutes a memory cgroup in places where it's
      necessary to charge a custom amount of bytes instead of pages.
      
      All charged memory rounded down to pages is charged to the corresponding
      memory cgroup using __memcg_kmem_charge().
      
      It implements reparenting: on memcg offlining it's getting reattached to
      the parent memory cgroup.  Each online memory cgroup has an associated
      active object cgroup to handle new allocations and the list of all
      attached object cgroups.  On offlining of a cgroup this list is reparented
      and for each object cgroup in the list the memcg pointer is swapped to the
      parent memory cgroup.  It prevents long-living objects from pinning the
      original memory cgroup in the memory.
      
      The implementation is based on byte-sized per-cpu stocks.  A sub-page
      sized leftover is stored in an atomic field, which is a part of obj_cgroup
      object.  So on cgroup offlining the leftover is automatically reparented.
      
      memcg->objcg is rcu protected.  objcg->memcg is a raw pointer, which is
      always pointing at a memory cgroup, but can be atomically swapped to the
      parent memory cgroup.  So a user must ensure the lifetime of the
      cgroup, e.g.  grab rcu_read_lock or css_set_lock.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-7-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf4f0599
    • R
      mm: slub: implement SLUB version of obj_to_index() · 4138fdfc
      Roman Gushchin 提交于
      This commit implements SLUB version of the obj_to_index() function, which
      will be required to calculate the offset of obj_cgroup in the obj_cgroups
      vector to store/obtain the objcg ownership data.
      
      To make it faster, let's repeat the SLAB's trick introduced by commit
      6a2d7a95 ("SLAB: use a multiply instead of a divide in
      obj_to_index()") and avoid an expensive division.
      
      Vlastimil Babka noticed, that SLUB does have already a similar function
      called slab_index(), which is defined only if SLUB_DEBUG is enabled.  The
      function does a similar math, but with a division, and it also takes a
      page address instead of a page pointer.
      
      Let's remove slab_index() and replace it with the new helper
      __obj_to_index(), which takes a page address.  obj_to_index() will be a
      simple wrapper taking a page pointer and passing page_address(page) into
      __obj_to_index().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4138fdfc
    • R
      mm: memcg: convert vmstat slab counters to bytes · d42f3245
      Roman Gushchin 提交于
      In order to prepare for per-object slab memory accounting, convert
      NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
      
      To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
      NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
      
      Internally global and per-node counters are stored in pages, however memcg
      and lruvec counters are stored in bytes.  This scheme may look weird, but
      only for now.  As soon as slab pages will be shared between multiple
      cgroups, global and node counters will reflect the total number of slab
      pages.  However memcg and lruvec counters will be used for per-memcg slab
      memory tracking, which will take separate kernel objects in the account.
      Keeping global and node counters in pages helps to avoid additional
      overhead.
      
      The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
      will fit into atomic_long_t we use for vmstats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d42f3245
    • R
      mm: memcg: prepare for byte-sized vmstat items · ea426c2a
      Roman Gushchin 提交于
      To implement per-object slab memory accounting, we need to convert slab
      vmstat counters to bytes.  Actually, out of 4 levels of counters: global,
      per-node, per-memcg and per-lruvec only two last levels will require
      byte-sized counters.  It's because global and per-node counters will be
      counting the number of slab pages, and per-memcg and per-lruvec will be
      counting the amount of memory taken by charged slab objects.
      
      Converting all vmstat counters to bytes or even all slab counters to bytes
      would introduce an additional overhead.  So instead let's store global and
      per-node counters in pages, and memcg and lruvec counters in bytes.
      
      To make the API clean all access helpers (both on the read and write
      sides) are dealing with bytes.
      
      To avoid back-and-forth conversions a new flavor of read-side helpers is
      introduced, which always returns values in pages: node_page_state_pages()
      and global_node_page_state_pages().
      
      Actually new helpers are just reading raw values.  Old helpers are simple
      wrappers, which will complain on an attempt to read byte value, because at
      the moment no one actually needs bytes.
      
      Thanks to Johannes Weiner for the idea of having the byte-sized API on top
      of the page-sized internal storage.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea426c2a
    • R
      mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() · eedc4e5a
      Roman Gushchin 提交于
      Patch series "The new cgroup slab memory controller", v7.
      
      The patchset moves the accounting from the page level to the object level.
      It allows to share slab pages between memory cgroups.  This leads to a
      significant win in the slab utilization (up to 45%) and the corresponding
      drop in the total kernel memory footprint.  The reduced number of
      unmovable slab pages should also have a positive effect on the memory
      fragmentation.
      
      The patchset makes the slab accounting code simpler: there is no more need
      in the complicated dynamic creation and destruction of per-cgroup slab
      caches, all memory cgroups use a global set of shared slab caches.  The
      lifetime of slab caches is not more connected to the lifetime of memory
      cgroups.
      
      The more precise accounting does require more CPU, however in practice the
      difference seems to be negligible.  We've been using the new slab
      controller in Facebook production for several months with different
      workloads and haven't seen any noticeable regressions.  What we've seen
      were memory savings in order of 1 GB per host (it varied heavily depending
      on the actual workload, size of RAM, number of CPUs, memory pressure,
      etc).
      
      The third version of the patchset added yet another step towards the
      simplification of the code: sharing of slab caches between accounted and
      non-accounted allocations.  It comes with significant upsides (most
      noticeable, a complete elimination of dynamic slab caches creation) but
      not without some regression risks, so this change sits on top of the
      patchset and is not completely merged in.  So in the unlikely event of a
      noticeable performance regression it can be reverted separately.
      
      The slab memory accounting works in exactly the same way for SLAB and
      SLUB.  With both allocators the new controller shows significant memory
      savings, with SLUB the difference is bigger.  On my 16-core desktop
      machine running Fedora 32 the size of the slab memory measured after the
      start of the system was lower by 58% and 38% with SLUB and SLAB
      correspondingly.
      
      As an estimation of a potential CPU overhead, below are results of
      slab_bulk_test01 test, kindly provided by Jesper D.  Brouer.  He also
      helped with the evaluation of results.
      
      The test can be found here: https://github.com/netoptimizer/prototype-kernel/
      The smallest number in each row should be picked for a comparison.
      
      SLUB-patched - bulk-API
       - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
       - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
      
      SLUB-original -  bulk-API
       - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
       - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
      
      SLAB-patched -  bulk-API
       - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
       - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
      
      SLAB-original-  bulk-API
       - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
       - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
      
      This patch (of 19):
      
      To convert memcg and lruvec slab counters to bytes there must be a way to
      change these counters without touching node counters.  Factor out
      __mod_memcg_lruvec_state() out of __mod_lruvec_state().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eedc4e5a
    • C
      tmpfs: support 64-bit inums per-sb · ea3271f7
      Chris Down 提交于
      The default is still set to inode32 for backwards compatibility, but
      system administrators can opt in to the new 64-bit inode numbers by
      either:
      
      1. Passing inode64 on the command line when mounting, or
      2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
      
      The inode64 and inode32 names are used based on existing precedent from
      XFS.
      
      [hughd@google.com: Kconfig fixes]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvilsSigned-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3271f7
    • C
      tmpfs: per-superblock i_ino support · e809d5f0
      Chris Down 提交于
      Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
      
      In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
      affected tiers, in excess of 10% of hosts show multiple files with
      different content and the same inode number, with some servers even having
      as many as 150 duplicated inode numbers with differing file content.
      
      This causes actual, tangible problems in production.  For example, we have
      complaints from those working on remote caches that their application is
      reporting cache corruptions because it uses (device, inodenum) to
      establish the identity of a particular cache object, but because it's not
      unique any more, the application refuses to continue and reports cache
      corruption.  Even worse, sometimes applications may not even detect the
      corruption but may continue anyway, causing phantom and hard to debug
      behaviour.
      
      In general, userspace applications expect that (device, inodenum) should
      be enough to be uniquely point to one inode, which seems fair enough.  One
      might also need to check the generation, but in this case:
      
      1. That's not currently exposed to userspace
         (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
      2. Even with generation, there shouldn't be two live inodes with the
         same inode number on one device.
      
      In order to mitigate this, we take a two-pronged approach:
      
      1. Moving inum generation from being global to per-sb for tmpfs. This
         itself allows some reduction in i_ino churn. This works on both 64-
         and 32- bit machines.
      2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
         64-bit ino_t only: we allow users to mount tmpfs with a new inode64
         option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
      
      You can see how this compares to previous related patches which didn't
      implement this per-superblock:
      
      - https://patchwork.kernel.org/patch/11254001/
      - https://patchwork.kernel.org/patch/11023915/
      
      This patch (of 2):
      
      get_next_ino has a number of problems:
      
      - It uses and returns a uint, which is susceptible to become overflowed
        if a lot of volatile inodes that use get_next_ino are created.
      - It's global, with no specificity per-sb or even per-filesystem. This
        means it's not that difficult to cause inode number wraparounds on a
        single device, which can result in having multiple distinct inodes
        with the same inode number.
      
      This patch adds a per-superblock counter that mitigates the second case.
      This design also allows us to later have a specific i_ino size per-device,
      for example, allowing users to choose whether to use 32- or 64-bit inodes
      for each tmpfs mount.  This is implemented in the next commit.
      
      For internal shmem mounts which may be less tolerant to spinlock delays,
      we implement a percpu batching scheme which only takes the stat_lock at
      each batch boundary.
      Signed-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e809d5f0
    • J
      mm, dump_page: do not crash with bad compound_mapcount() · 6dc5ea16
      John Hubbard 提交于
      If a compound page is being split while dump_page() is being run on that
      page, we can end up calling compound_mapcount() on a page that is no
      longer compound.  This leads to a crash (already seen at least once in the
      field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount().
      
      (The above is from Matthew Wilcox's analysis of Qian Cai's bug report.)
      
      A similar problem is possible, via compound_pincount() instead of
      compound_mapcount().
      
      In order to avoid this kind of crash, make dump_page() slightly more
      robust, by providing a pair of simpler routines that don't contain
      assertions: head_mapcount() and head_pincount().
      
      For debug tools, we don't want to go *too* far in this direction, but this
      is a simple small fix, and the crash has already been seen, so it's a good
      trade-off.
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dc5ea16
    • W
      mm, treewide: rename kzfree() to kfree_sensitive() · 453431a5
      Waiman Long 提交于
      As said by Linus:
      
        A symmetric naming is only helpful if it implies symmetries in use.
        Otherwise it's actively misleading.
      
        In "kzalloc()", the z is meaningful and an important part of what the
        caller wants.
      
        In "kzfree()", the z is actively detrimental, because maybe in the
        future we really _might_ want to use that "memfill(0xdeadbeef)" or
        something. The "zero" part of the interface isn't even _relevant_.
      
      The main reason that kzfree() exists is to clear sensitive information
      that should not be leaked to other future users of the same memory
      objects.
      
      Rename kzfree() to kfree_sensitive() to follow the example of the recently
      added kvfree_sensitive() and make the intention of the API more explicit.
      In addition, memzero_explicit() is used to clear the memory to make sure
      that it won't get optimized away by the compiler.
      
      The renaming is done by using the command sequence:
      
        git grep -w --name-only kzfree |\
        xargs sed -i 's/kzfree/kfree_sensitive/'
      
      followed by some editing of the kfree_sensitive() kerneldoc and adding
      a kzfree backward compatibility macro in slab.h.
      
      [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
      [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453431a5
    • R
      mm/migrate: fix migrate_pgmap_owner w/o CONFIG_MMU_NOTIFIER · c1a06df6
      Ralph Campbell 提交于
      On x86_64, when CONFIG_MMU_NOTIFIER is not set/enabled, there is a
      compiler error:
      
         mm/migrate.c: In function 'migrate_vma_collect':
         mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member named 'migrate_pgmap_owner'
           range.migrate_pgmap_owner = migrate->pgmap_owner;
                ^
      
      Fixes: 998427b3 ("mm/notifier: add migration invalidation type")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Jason Gunthorpe" <jgg@mellanox.com>
      Link: http://lkml.kernel.org/r/20200806193353.7124-1-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1a06df6
  2. 05 8月, 2020 1 次提交
    • S
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio 提交于
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df23bb18
  3. 04 8月, 2020 6 次提交
  4. 03 8月, 2020 2 次提交
  5. 02 8月, 2020 3 次提交
  6. 01 8月, 2020 3 次提交
    • V
      sched: Document arch_scale_*_capacity() · f4470cdf
      Valentin Schneider 提交于
      Rather that hide their purpose in some dark, damp corner of Documentation/,
      add some documentation to the default implementations.
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200731192016.7484-2-valentin.schneider@arm.com
      f4470cdf
    • R
      rtnetlink: add support for protodown reason · 829eb208
      Roopa Prabhu 提交于
      netdev protodown is a mechanism that allows protocols to
      hold an interface down. It was initially introduced in
      the kernel to hold links down by a multihoming protocol.
      There was also an attempt to introduce protodown
      reason at the time but was rejected. protodown and protodown reason
      is supported by almost every switching and routing platform.
      It was ok for a while to live without a protodown reason.
      But, its become more critical now given more than
      one protocol may need to keep a link down on a system
      at the same time. eg: vrrp peer node, port security,
      multihoming protocol. Its common for Network operators and
      protocol developers to look for such a reason on a networking
      box (Its also known as errDisable by most networking operators)
      
      This patch adds support for link protodown reason
      attribute. There are two ways to maintain protodown
      reasons.
      (a) enumerate every possible reason code in kernel
          - A protocol developer has to make a request and
            have that appear in a certain kernel version
      (b) provide the bits in the kernel, and allow user-space
      (sysadmin or NOS distributions) to manage the bit-to-reasonname
      map.
      	- This makes extending reason codes easier (kind of like
            the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/)
      
      This patch takes approach (b).
      
      a few things about the patch:
      - It treats the protodown reason bits as counter to indicate
      active protodown users
      - Since protodown attribute is already an exposed UAPI,
      the reason is not enforced on a protodown set. Its a no-op
      if not used.
      the patch follows the below algorithm:
        - presence of reason bits set indicates protodown
          is in use
        - user can set protodown and protodown reason in a
          single or multiple setlink operations
        - setlink operation to clear protodown, will return -EBUSY
          if there are active protodown reason bits
        - reason is not included in link dumps if not used
      
      example with patched iproute2:
      $cat /etc/iproute2/protodown_reasons.d/r.conf
      0 mlag
      1 evpn
      2 vrrp
      3 psecurity
      
      $ip link set dev vxlan0 protodown on protodown_reason vrrp on
      $ip link set dev vxlan0 protodown_reason mlag on
      $ip link show
      14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
      DEFAULT group default qlen 1000
          link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp>
      
      $ip link set dev vxlan0 protodown_reason mlag off
      $ip link set dev vxlan0 protodown off protodown_reason vrrp off
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829eb208
    • Y
      tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS · 48040793
      Yousuk Seung 提交于
      This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
      the earliest departure time(EDT) of the timestamped skb. By tracking EDT
      values of the skb from different timestamps, we can observe when and how
      much the value changed. This allows to measure the precise delay
      injected on the sender host e.g. by a bpf-base throttler.
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48040793
  7. 31 7月, 2020 7 次提交
  8. 30 7月, 2020 5 次提交
    • C
      PM / devfreq: Add support delayed timer for polling mode · 4dc3bab8
      Chanwoo Choi 提交于
      Until now, the devfreq driver using polling mode like simple_ondemand
      governor have used only deferrable timer for reducing the redundant
      power consumption. It reduces the CPU wake-up from idle due to polling mode
      which check the status of Non-CPU device.
      
      But, it has a problem for Non-CPU device like DMC device with DMA operation.
      Some Non-CPU device need to do monitor continuously regardless of CPU state
      in order to decide the proper next status of Non-CPU device.
      
      So, add support the delayed timer for polling mode to support
      the repetitive monitoring. The devfreq driver and user can select
      the kind of timer on either deferrable and delayed timer.
      
      For example, change the timer type of DMC device
      based on Exynos5422-based Odroid-XU3 as following:
      
      - If want to use deferrable timer as following:
      echo deferrable > /sys/class/devfreq/10c20000.memory-controller/timer
      
      - If want to use delayed timer as following:
      echo delayed > /sys/class/devfreq/10c20000.memory-controller/timer
      Reviewed-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Reviewed-by: NLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: NChanwoo Choi <cw00.choi@samsung.com>
      4dc3bab8
    • A
      driver core: add device probe log helper · a787e540
      Andrzej Hajda 提交于
      During probe every time driver gets resource it should usually check for
      error printk some message if it is not -EPROBE_DEFER and return the error.
      This pattern is simple but requires adding few lines after any resource
      acquisition code, as a result it is often omitted or implemented only
      partially.
      dev_err_probe helps to replace such code sequences with simple call,
      so code:
      	if (err != -EPROBE_DEFER)
      		dev_err(dev, ...);
      	return err;
      becomes:
      	return dev_err_probe(dev, err, ...);
      Signed-off-by: NAndrzej Hajda <a.hajda@samsung.com>
      Reviewed-by: NRafael J. Wysocki <rafael@kernel.org>
      Reviewed-by: NMark Brown <broonie@kernel.org>
      Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Link: https://lore.kernel.org/r/20200713144324.23654-2-a.hajda@samsung.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a787e540
    • L
      random32: remove net_rand_state from the latent entropy gcc plugin · 83bdc727
      Linus Torvalds 提交于
      It turns out that the plugin right now ends up being really unhappy
      about the change from 'static' to 'extern' storage that happened in
      commit f227e3ec ("random32: update the net random state on interrupt
      and activity").
      
      This is probably a trivial fix for the latent_entropy plugin, but for
      now, just remove net_rand_state from the list of things the plugin
      worries about.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83bdc727
    • W
      random32: update the net random state on interrupt and activity · f227e3ec
      Willy Tarreau 提交于
      This modifies the first 32 bits out of the 128 bits of a random CPU's
      net_rand_state on interrupt or CPU activity to complicate remote
      observations that could lead to guessing the network RNG's internal
      state.
      
      Note that depending on some network devices' interrupt rate moderation
      or binding, this re-seeding might happen on every packet or even almost
      never.
      
      In addition, with NOHZ some CPUs might not even get timer interrupts,
      leaving their local state rarely updated, while they are running
      networked processes making use of the random state.  For this reason, we
      also perform this update in update_process_times() in order to at least
      update the state when there is user or system activity, since it's the
      only case we care about.
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f227e3ec
    • N
      cpuidle: change enter_s2idle() prototype · efe97112
      Neal Liu 提交于
      Control Flow Integrity(CFI) is a security mechanism that disallows
      changes to the original control flow graph of a compiled binary,
      making it significantly harder to perform such attacks.
      
      init_state_node() assign same function callback to different
      function pointer declarations.
      
      static int init_state_node(struct cpuidle_state *idle_state,
                                 const struct of_device_id *matches,
                                 struct device_node *state_node) { ...
              idle_state->enter = match_id->data; ...
              idle_state->enter_s2idle = match_id->data; }
      
      Function declarations:
      
      struct cpuidle_state { ...
              int (*enter) (struct cpuidle_device *dev,
                            struct cpuidle_driver *drv,
                            int index);
      
              void (*enter_s2idle) (struct cpuidle_device *dev,
                                    struct cpuidle_driver *drv,
                                    int index); };
      
      In this case, either enter() or enter_s2idle() would cause CFI check
      failed since they use same callee.
      
      Align function prototype of enter() since it needs return value for
      some use cases. The return value of enter_s2idle() is no
      need currently.
      Signed-off-by: NNeal Liu <neal.liu@mediatek.com>
      Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      efe97112