1. 19 10月, 2020 1 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  2. 15 10月, 2020 1 次提交
  3. 14 10月, 2020 9 次提交
    • M
      memblock: use separate iterators for memory and reserved regions · cc6de168
      Mike Rapoport 提交于
      for_each_memblock() is used to iterate over memblock.memory in a few
      places that use data from memblock_region rather than the memory ranges.
      
      Introduce separate for_each_mem_region() and
      for_each_reserved_mem_region() to improve encapsulation of memblock
      internals from its users.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: Ingo Molnar <mingo@kernel.org>			[x86]
      Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>	[MIPS]
      Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc6de168
    • M
      x86/setup: simplify reserve_crashkernel() · 6120cdc0
      Mike Rapoport 提交于
      * Replace magic numbers with defines
      * Replace memblock_find_in_range() + memblock_reserve() with
        memblock_phys_alloc_range()
      * Stop checking for low memory size in reserve_crashkernel_low(). The
        allocation from limited range will anyway fail if there is no enough
        memory, so there is no need for extra traversal of memblock.memory
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-15-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6120cdc0
    • M
      x86/setup: simplify initrd relocation and reservation · 3c45ee6d
      Mike Rapoport 提交于
      Currently, initrd image is reserved very early during setup and then it
      might be relocated and re-reserved after the initial physical memory
      mapping is created.  The "late" reservation of memblock verifies that
      mapped memory size exceeds the size of initrd, then checks whether the
      relocation required and, if yes, relocates inirtd to a new memory
      allocated from memblock and frees the old location.
      
      The check for memory size is excessive as memblock allocation will anyway
      fail if there is not enough memory.  Besides, there is no point to
      allocate memory from memblock using memblock_find_in_range() +
      memblock_reserve() when there exists memblock_phys_alloc_range() with
      required functionality.
      
      Remove the redundant check and simplify memblock allocation.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-14-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c45ee6d
    • D
      mm/memory_hotplug: introduce default phys_to_target_node() implementation · a035b6bf
      Dan Williams 提交于
      In preparation to set a fallback value for dev_dax->target_node, introduce
      generic fallback helpers for phys_to_target_node()
      
      A generic implementation based on node-data or memblock was proposed, but
      as noted by Mike:
      
          "Here again, I would prefer to add a weak default for
           phys_to_target_node() because the "generic" implementation is not really
           generic.
      
           The fallback to reserved ranges is x86 specfic because on x86 most of
           the reserved areas is not in memblock.memory. AFAIK, no other
           architecture does this."
      
      The info message in the generic memory_add_physaddr_to_nid()
      implementation is fixed up to properly reflect that
      memory_add_physaddr_to_nid() communicates "online" node info and
      phys_to_target_node() indicates "target / to-be-onlined" node info.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
        Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a035b6bf
    • D
      efi/fake_mem: arrange for a resource entry per efi_fake_mem instance · 88e9a5b7
      Dan Williams 提交于
      In preparation for attaching a platform device per iomem resource teach
      the efi_fake_mem code to create an e820 entry per instance.  Similar to
      E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643096068.4062302.11590041070221681669.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e9a5b7
    • D
      x86/numa: add 'nohmat' option · 3b0d3101
      Dan Williams 提交于
      Disable parsing of the HMAT for debug, to workaround broken platform
      instances, or cases where it is otherwise not wanted.
      
      [rdunlap@infradead.org: fix build when CONFIG_ACPI is not set]
        Link: https://lkml.kernel.org/r/70e5ee34-9809-a997-7b49-499e4be61307@infradead.orgSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643095540.4062302.732962081968036212.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b0d3101
    • D
      x86/numa: cleanup configuration dependent command-line options · 2dd57d34
      Dan Williams 提交于
      Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5.
      
      The device-dax facility allows an address range to be directly mapped
      through a chardev, or optionally hotplugged to the core kernel page
      allocator as System-RAM.  It is the mechanism for converting persistent
      memory (pmem) to be used as another volatile memory pool i.e.  the current
      Memory Tiering hot topic on linux-mm.
      
      In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
      it, but that labeling mechanism is not available / applicable to
      soft-reserved ("EFI specific purpose") memory [3].  This series provides a
      sysfs-mechanism for the daxctl utility to enable provisioning of
      volatile-soft-reserved memory ranges.
      
      The motivations for this facility are:
      
      1/ Allow performance differentiated memory ranges to be split between
         kernel-managed and directly-accessed use cases.
      
      2/ Allow physical memory to be provisioned along performance relevant
         address boundaries. For example, divide a memory-side cache [4] along
         cache-color boundaries.
      
      3/ Parcel out soft-reserved memory to VMs using device-dax as a security
         / permissions boundary [5]. Specifically I have seen people (ab)using
         memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
         device-dax interface on custom address ranges. A follow-on for the VM
         use case is to teach device-dax to dynamically allocate 'struct page' at
         runtime to reduce the duplication of 'struct page' space in both the
         guest and the host kernel for the same physical pages.
      
      [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
      [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
      [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
      [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
      
      This patch (of 23):
      
      In preparation for adding a new numa= option clean up the existing ones to
      avoid ifdefs in numa_setup(), and provide feedback when the option is
      numa=fake= option is invalid due to kernel config.  The same does not need
      to be done for numa=noacpi, since the capability is already hard disabled
      at compile-time.
      Suggested-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd57d34
    • M
    • T
      x86/traps: Fix #DE Oops message regression · 5f1ec1fd
      Thomas Gleixner 提交于
      The conversion of #DE to the idtentry mechanism introduced a change in the
      Ooops message which confuses tools which parse crash information in dmesg.
      
      Remove the underscore from 'divide_error' to restore previous behaviour.
      
      Fixes: 9d06c402 ("x86/entry: Convert Divide Error to IDTENTRY")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/CACT4Y+bTZFkuZd7+bPArowOv-7Die+WZpfOWnEO_Wgs3U59+oA@mail.gmail.com
      5f1ec1fd
  4. 13 10月, 2020 3 次提交
    • N
      x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT · 865c50e1
      Nick Desaulniers 提交于
      Clang-11 shipped support for outputs to asm goto statments along the
      fallthrough path.  Double up some of the get_user() and related macros
      to be able to take advantage of this extended GNU C extension. This
      should help improve the generated code's performance for these accesses.
      
      Cc: Bill Wendling <morbo@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      865c50e1
    • L
      x86: Make __put_user() generate an out-of-line call · d55564cf
      Linus Torvalds 提交于
      Instead of inlining the stac/mov/clac sequence (which also requires
      individual exception table entries and several asm instruction
      alternatives entries), just generate "call __put_user_nocheck_X" for the
      __put_user() cases, the same way we changed __get_user earlier.
      
      Unlike the get_user() case, we didn't have the same nice infrastructure
      to just generate the call with a single case, so this actually has to
      change some of the infrastructure in order to do this.  But that only
      cleans up the code further.
      
      So now, instead of using a case statement for the sizes, we just do the
      same thing we've done on the get_user() side for a long time: use the
      size as an immediate constant to the asm, and generate the asm that way
      directly.
      
      In order to handle the special case of 64-bit data on a 32-bit kernel, I
      needed to change the calling convention slightly: the data is passed in
      %eax[:%edx], the pointer in %ecx, and the return value is also returned
      in %ecx.  It used to be returned in %eax, but because of how %eax can
      now be a double register input, we don't want mix that with a
      single-register output.
      
      The actual low-level asm is easier to handle: we'll just share the code
      between the checking and non-checking case, with the non-checking case
      jumping into the middle of the function.  That may sound a bit too
      special, but this code is all very very special anyway, so...
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d55564cf
    • L
      x86: Make __get_user() generate an out-of-line call · ea6f043f
      Linus Torvalds 提交于
      Instead of inlining the whole stac/lfence/mov/clac sequence (which also
      requires individual exception table entries and several asm instruction
      alternatives entries), just generate "call __get_user_nocheck_X" for the
      __get_user() cases.
      
      We can use all the same infrastructure that we already do for the
      regular "get_user()", and the end result is simpler source code, and
      much simpler code generation.
      
      It also means that when I introduce asm goto with input for
      "unsafe_get_user()", there are no nasty interactions with the
      __get_user() code.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6f043f
  5. 09 10月, 2020 1 次提交
  6. 08 10月, 2020 1 次提交
  7. 07 10月, 2020 20 次提交
  8. 06 10月, 2020 4 次提交
    • P
      perf/x86: Fix n_metric for cancelled txn · 3dbde695
      Peter Zijlstra 提交于
      When a group that has TopDown members is failed to be scheduled, any
      later TopDown groups will not return valid values.
      
      Here is an example.
      
      A background perf that occupies all the GP counters and the fixed
      counter 1.
       $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles,
                       cycles,cycles}:D" -a
      
      A user monitors a TopDown group. It works well, because the fixed
      counter 3 and the PERF_METRICS are available.
       $perf stat -x, --topdown -- ./workload
         retiring,bad speculation,frontend bound,backend bound,
         18.0,16.1,40.4,25.5,
      
      Then the user tries to monitor a group that has TopDown members.
      Because of the cycles event, the group is failed to be scheduled.
       $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound,
                           topdown-fe-bound,topdown-bad-spec,cycles}'
                           -- ./workload
          <not counted>,,slots,0,0.00,,
          <not counted>,,topdown-retiring,0,0.00,,
          <not counted>,,topdown-be-bound,0,0.00,,
          <not counted>,,topdown-fe-bound,0,0.00,,
          <not counted>,,topdown-bad-spec,0,0.00,,
          <not counted>,,cycles,0,0.00,,
      
      The user tries to monitor a TopDown group again. It doesn't work anymore.
       $perf stat -x, --topdown -- ./workload
      
          ,,,,,
      
      In a txn, cancel_txn() is to truncate the event_list for a canceled
      group and update the number of events added in this transaction.
      However, the number of TopDown events added in this transaction is not
      updated. The kernel will probably fail to add new Topdown events.
      
      Fixes: 7b2c05a1 ("perf/x86/intel: Generic support for hardware TopDown metrics")
      Reported-by: NAndi Kleen <ak@linux.intel.com>
      Reported-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NKan Liang <kan.liang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.net
      3dbde695
    • P
      perf/x86: Fix n_pair for cancelled txn · 871a93b0
      Peter Zijlstra 提交于
      Kan reported that n_metric gets corrupted for cancelled transactions;
      a similar issue exists for n_pair for AMD's Large Increment thing.
      
      The problem was confirmed and confirmed fixed by Kim using:
      
        sudo perf stat -e "{cycles,cycles,cycles,cycles}:D" -a sleep 10 &
      
        # should succeed:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
        # should fail:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.all,cycles}:D" -a workload
      
        # previously failed, now succeeds with this patch:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Reported-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NKim Phillips <kim.phillips@amd.com>
      Link: https://lkml.kernel.org/r/20201005082516.GG2628@hirez.programming.kicks-ass.net
      871a93b0
    • D
      x86/copy_mc: Introduce copy_mc_enhanced_fast_string() · 5da8e4a6
      Dan Williams 提交于
      The motivations to go rework memcpy_mcsafe() are that the benefit of
      doing slow and careful copies is obviated on newer CPUs, and that the
      current opt-in list of CPUs to instrument recovery is broken relative to
      those CPUs.  There is no need to keep an opt-in list up to date on an
      ongoing basis if pmem/dax operations are instrumented for recovery by
      default. With recovery enabled by default the old "mcsafe_key" opt-in to
      careful copying can be made a "fragile" opt-out. Where the "fragile"
      list takes steps to not consume poison across cachelines.
      
      The discussion with Linus made clear that the current "_mcsafe" suffix
      was imprecise to a fault. The operations that are needed by pmem/dax are
      to copy from a source address that might throw #MC to a destination that
      may write-fault, if it is a user page.
      
      So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
      the separate precautions taken on source and destination.
      copy_mc_to_kernel() is introduced as a non-SMAP version that does not
      expect write-faults on the destination, but is still prepared to abort
      with an error code upon taking #MC.
      
      The original copy_mc_fragile() implementation had negative performance
      implications since it did not use the fast-string instruction sequence
      to perform copies. For this reason copy_mc_to_kernel() fell back to
      plain memcpy() to preserve performance on platforms that did not indicate
      the capability to recover from machine check exceptions. However, that
      capability detection was not architectural and now that some platforms
      can recover from fast-string consumption of memory errors the memcpy()
      fallback now causes these more capable platforms to fail.
      
      Introduce copy_mc_enhanced_fast_string() as the fast default
      implementation of copy_mc_to_kernel() and finalize the transition of
      copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
      With this in place, copy_mc_to_kernel() is fast and recovery-ready by
      default regardless of hardware capability.
      
      Thanks to Vivek for identifying that copy_user_generic() is not suitable
      as the copy_mc_to_user() backend since the #MC handler explicitly checks
      ex_has_fault_handler(). Thanks to the 0day robot for catching a
      performance bug in the x86/copy_mc_to_user implementation.
      
       [ bp: Add the "why" for this change from the 0/2th message, massage. ]
      
      Fixes: 92b0729c ("x86/mm, x86/mce: Add memcpy_mcsafe()")
      Reported-by: NErwin Tsaur <erwin.tsaur@intel.com>
      Reported-by: N0day robot <lkp@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Tested-by: NErwin Tsaur <erwin.tsaur@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@dwillia2-desk3.amr.corp.intel.com
      5da8e4a6
    • D
      x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() · ec6347bb
      Dan Williams 提交于
      In reaction to a proposal to introduce a memcpy_mcsafe_fast()
      implementation Linus points out that memcpy_mcsafe() is poorly named
      relative to communicating the scope of the interface. Specifically what
      addresses are valid to pass as source, destination, and what faults /
      exceptions are handled.
      
      Of particular concern is that even though x86 might be able to handle
      the semantics of copy_mc_to_user() with its common copy_user_generic()
      implementation other archs likely need / want an explicit path for this
      case:
      
        On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
        >
        > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
        > >
        > > However now I see that copy_user_generic() works for the wrong reason.
        > > It works because the exception on the source address due to poison
        > > looks no different than a write fault on the user address to the
        > > caller, it's still just a short copy. So it makes copy_to_user() work
        > > for the wrong reason relative to the name.
        >
        > Right.
        >
        > And it won't work that way on other architectures. On x86, we have a
        > generic function that can take faults on either side, and we use it
        > for both cases (and for the "in_user" case too), but that's an
        > artifact of the architecture oddity.
        >
        > In fact, it's probably wrong even on x86 - because it can hide bugs -
        > but writing those things is painful enough that everybody prefers
        > having just one function.
      
      Replace a single top-level memcpy_mcsafe() with either
      copy_mc_to_user(), or copy_mc_to_kernel().
      
      Introduce an x86 copy_mc_fragile() name as the rename for the
      low-level x86 implementation formerly named memcpy_mcsafe(). It is used
      as the slow / careful backend that is supplanted by a fast
      copy_mc_generic() in a follow-on patch.
      
      One side-effect of this reorganization is that separating copy_mc_64.S
      to its own file means that perf no longer needs to track dependencies
      for its memcpy_64.S benchmarks.
      
       [ bp: Massage a bit. ]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
      Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
      ec6347bb