“182ce51c6d73d98420aa91d998a328503eac538d”上不存在“paddle/fluid/operators/optimizers/sgd_op.cc”
  1. 17 7月, 2019 20 次提交
  2. 16 7月, 2019 2 次提交
  3. 15 7月, 2019 1 次提交
  4. 13 7月, 2019 17 次提交
    • A
      perf/core: Fix exclusive events' grouping · 8a58ddae
      Alexander Shishkin 提交于
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a58ddae
    • D
      net: phy: make exported variables non-static · 54638c6e
      Denis Efremov 提交于
      The variables phy_basic_ports_array, phy_fibre_port_array and
      phy_all_ports_features_array are declared static and marked
      EXPORT_SYMBOL_GPL(), which is at best an odd combination.
      Because the variables were decided to be a part of API, this commit
      removes the static attributes and adds the declarations to the header.
      
      Fixes: 3c1bcc86 ("net: ethernet: Convert phydev advertize and supported from u32 to link mode")
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54638c6e
    • S
      oom: decouple mems_allowed from oom_unkillable_task · ac311a14
      Shakeel Butt 提交于
      Commit ef08e3b4 ("[PATCH] cpusets: confine oom_killer to
      mem_exclusive cpuset") introduces a heuristic where a potential
      oom-killer victim is skipped if the intersection of the potential victim
      and the current (the process triggered the oom) is empty based on the
      reason that killing such victim most probably will not help the current
      allocating process.
      
      However the commit 7887a3da ("[PATCH] oom: cpuset hint") changed the
      heuristic to just decrease the oom_badness scores of such potential
      victim based on the reason that the cpuset of such processes might have
      changed and previously they may have allocated memory on mems where the
      current allocating process can allocate from.
      
      Unintentionally 7887a3da ("[PATCH] oom: cpuset hint") introduced a
      side effect as the oom_badness is also exposed to the user space through
      /proc/[pid]/oom_score, so, readers with different cpusets can read
      different oom_score of the same process.
      
      Later, commit 6cf86ac6 ("oom: filter tasks not sharing the same
      cpuset") fixed the side effect introduced by 7887a3da by moving the
      cpuset intersection back to only oom-killer context and out of
      oom_badness.  However the combination of ab290adb ("oom: make
      oom_unkillable_task() helper function") and 26ebc984 ("oom:
      /proc/<pid>/oom_score treat kernel thread honestly") unintentionally
      brought back the cpuset intersection check into the oom_badness
      calculation function.
      
      Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
      oom context is also doing cpuset/mempolicy intersection which is quite
      wrong and is caught by syzcaller with the following report:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
        oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
        mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
        select_bad_process mm/oom_kill.c:374 [inline]
        out_of_memory mm/oom_kill.c:1088 [inline]
        out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
        mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
        mem_cgroup_oom mm/memcontrol.c:1905 [inline]
        try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
        mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
        mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
        do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
        do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
        wp_huge_pmd mm/memory.c:3793 [inline]
        __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
        handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
        do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
        __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
        do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
        page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
      RIP: 0033:0x400590
      Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
      8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
      00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
      RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
      RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
      RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
      R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
      Modules linked in:
      ---[ end trace a65689219582ffff ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      
      The fix is to decouple the cpuset/mempolicy intersection check from
      oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
      only done in the global oom context.
      
      [shakeelb@google.com: change function name and update comment]
        Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac311a14
    • S
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt 提交于
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • R
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo · 97105f0a
      Roman Gushchin 提交于
      Vmalloc() is getting more and more used these days (kernel stacks, bpf and
      percpu allocator are new top users), and the total % of memory consumed by
      vmalloc() can be pretty significant and changes dynamically.
      
      /proc/meminfo is the best place to display this information: its top goal
      is to show top consumers of the memory.
      
      Since the VmallocUsed field in /proc/meminfo is not in use for quite a
      long time (it has been defined to 0 by a5ad88ce ("mm: get rid of
      'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
      actual physical memory consumption of vmalloc().
      
      Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97105f0a
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • A
      mm/pgtable: drop pgtable_t variable from pte_fn_t functions · 8b1e0f81
      Anshuman Khandual 提交于
      Drop the pgtable_t variable from all implementation for pte_fn_t as none
      of them use it.  apply_to_pte_range() should stop computing it as well.
      Should help us save some cycles.
      
      Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b1e0f81
    • C
      mm: move the powerpc hugepd code to mm/gup.c · cbd34da7
      Christoph Hellwig 提交于
      While only powerpc supports the hugepd case, the code is pretty generic
      and I'd like to keep all GUP internals in one place.
      
      Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbd34da7
    • W
      mm, memcg: add a memcg_slabinfo debugfs file · fcf8a1e4
      Waiman Long 提交于
      There are concerns about memory leaks from extensive use of memory cgroups
      as each memory cgroup creates its own set of kmem caches.  There is a
      possiblity that the memcg kmem caches may remain even after the memory
      cgroups have been offlined.  Therefore, it will be useful to show the
      status of each of memcg kmem caches.
      
      This patch introduces a new <debugfs>/memcg_slabinfo file which is
      somewhat similar to /proc/slabinfo in format, but lists only information
      about kmem caches that have child memcg kmem caches.  Information
      available in /proc/slabinfo are not repeated in memcg_slabinfo.
      
      A portion of a sample output of the file was:
      
        # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
        rpc_inode_cache   root          13     51      1      1
        rpc_inode_cache     48           0      0      0      0
        fat_inode_cache   root           1     45      1      1
        fat_inode_cache     41           2     45      1      1
        xfs_inode         root         770    816     24     24
        xfs_inode           92          22     34      1      1
        xfs_inode           88:dead      1     34      1      1
        xfs_inode           89:dead     23     34      1      1
        xfs_inode           85           4     34      1      1
        xfs_inode           84           9     34      1      1
      
      The css id of the memcg is also listed. If a memcg is not online,
      the tag ":dead" will be attached as shown above.
      
      [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
        Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
      [longman@redhat.com: set the flag in the common code as suggested by Roman]
        Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
      Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf8a1e4
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg() · 49a18eae
      Roman Gushchin 提交于
      Let's separate the page counter modification code out of
      __memcg_kmem_uncharge() in a way similar to what
      __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
      
      This will allow to reuse this code later using a new
      memcg_kmem_uncharge_memcg() wrapper, which calls
      __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
      check is passed.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a18eae
    • R
      mm: memcg/slab: rename slab delayed deactivation functions and fields · 0b14e8aa
      Roman Gushchin 提交于
      The delayed work/rcu deactivation infrastructure of non-root kmem_caches
      can be also used for asynchronous release of these objects.  Let's get rid
      of the word "deactivation" in corresponding names to make the code look
      better after generalization.
      
      It's easier to make the renaming first, so that the generalized code will
      look consistent from scratch.
      
      Let's rename struct memcg_cache_params fields:
        deact_fn -> work_fn
        deact_rcu_head -> rcu_head
        deact_work -> work
      
      And RCU/delayed work callbacks in slab common code:
        kmemcg_deactivate_rcufn -> kmemcg_rcufn
        kmemcg_deactivate_workfn -> kmemcg_workfn
      
      This patch contains no functional changes, only renamings.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b14e8aa
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
    • A
      mm, swap: use rbtree for swap_extent · 4efaceb1
      Aaron Lu 提交于
      swap_extent is used to map swap page offset to backing device's block
      offset.  For a continuous block range, one swap_extent is used and all
      these swap_extents are managed in a linked list.
      
      These swap_extents are used by map_swap_entry() during swap's read and
      write path.  To find out the backing device's block offset for a page
      offset, the swap_extent list will be traversed linearly, with
      curr_swap_extent being used as a cache to speed up the search.
      
      This works well as long as swap_extents are not huge or when the number
      of processes that access swap device are few, but when the swap device
      has many extents and there are a number of processes accessing the swap
      device concurrently, it can be a problem.  On one of our servers, the
      disk's remaining size is tight:
      
        $df -h
        Filesystem      Size  Used Avail Use% Mounted on
        ... ...
        /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4
      
      When creating a 80G swapfile there, there are as many as 84656 swap
      extents.  The end result is, kernel spends abou 30% time in
      map_swap_entry() and swap throughput is only 70MB/s.
      
      As a comparison, when I used smaller sized swapfile, like 4G whose
      swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
      map_swap_entry() is about 3%.
      
      One downside of using rbtree for swap_extent is, 'struct rbtree' takes
      24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
      for each swap_extent.  For a swapfile that has 80k swap_extents, that
      means 625KiB more memory consumed.
      
      Test:
      
      Since it's not possible to reboot that server, I can not test this patch
      diretly there.  Instead, I tested it on another server with NVMe disk.
      
      I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
      filesystem is quite clean and the created swapfile has only 2 extents.
      Testing vanilla and this patch shows no obvious performance difference
      when swapfile is not fragmented.
      
      To see the patch's effects, I used some tweaks to manually fragment the
      swapfile by breaking the extent at 1M boundary.  This made the swapfile
      have 20K extents.
      
        nr_task=4
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  165191           90.77%             171798          90.21%
        patched  858993 +420%      2.16%             715827 +317%     0.77%
      
        nr_task=8
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  306783           92.19%             318145          87.76%
        patched  954437 +211%      2.35%            1073741 +237%     1.57%
      
      swapout: the throughput of swap out, in KB/s, higher is better 1st
      map_swap_entry: cpu cycles percent sampled by perf swapin: the
      throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
      cpu cycles percent sampled by perf
      
      nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
      can be effectively used to cache the correct swap extent for single task
      workload.
      
      [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
      Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronluSigned-off-by: NAaron Lu <ziqian.lzq@antfin.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4efaceb1
    • H
      mm, swap: fix race between swapoff and some swap operations · eb085574
      Huang Ying 提交于
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb085574
    • C
      mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page · 6c45b454
      Christoph Hellwig 提交于
      We can just pass a NULL filler and do the right thing inside of
      do_read_cache_page based on the NULL parameter.
      
      Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c45b454