1. 21 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
  2. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  3. 05 1月, 2019 2 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
    • L
      Fix access_ok() fallout for sparc32 and powerpc · 4caf4ebf
      Linus Torvalds 提交于
      These two architectures actually had an intentional use of the 'type'
      argument to access_ok() just to avoid warnings.
      
      I had actually noticed the powerpc one, but forgot to then fix it up.
      And I missed the sparc32 case entirely.
      
      This is hopefully all of it.
      Reported-by: NMathieu Malaterre <malat@debian.org>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Fixes: 96d4f267 ("Remove 'type' argument from access_ok() function")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4caf4ebf
  4. 04 1月, 2019 2 次提交
    • M
      powerpc: Drop use of 'type' from access_ok() · 074400a7
      Mathieu Malaterre 提交于
      In commit 05a4ab82 ("powerpc/uaccess: fix warning/error with
      access_ok()") an attempt was made to remove a warning by referencing
      the variable `type`. However in commit 96d4f267 ("Remove 'type'
      argument from access_ok() function") the variable `type` has been
      removed, breaking the build:
      
        arch/powerpc/include/asm/uaccess.h:66:32: error: ‘type’ undeclared (first use in this function)
      
      This essentially reverts commit 05a4ab82 ("powerpc/uaccess: fix
      warning/error with access_ok()") to fix the error.
      
      Fixes: 96d4f267 ("Remove 'type' argument from access_ok() function")
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      [mpe: Reword change log slightly.]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      074400a7
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  5. 22 12月, 2018 2 次提交
  6. 21 12月, 2018 16 次提交
    • L
      KVM: Make kvm_set_spte_hva() return int · 748c0e31
      Lan Tianyu 提交于
      The patch is to make kvm_set_spte_hva() return int and caller can
      check return value to determine flush tlb or not.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      748c0e31
    • A
      powerpc/powernv/npu: Add compound IOMMU groups · 0bd97167
      Alexey Kardashevskiy 提交于
      At the moment the powernv platform registers an IOMMU group for each
      PE. There is an exception though: an NVLink bridge which is attached
      to the corresponding GPU's IOMMU group making it a master.
      
      Now we have POWER9 systems with GPUs connected to each other directly
      bypassing PCI. At the moment we do not control state of these links so
      we have to put such interconnected GPUs to one IOMMU group which means
      that the old scheme with one GPU as a master won't work - there will
      be up to 3 GPUs in such group.
      
      This introduces a npu_comp struct which represents a compound IOMMU
      group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
      NVLink bridges). This converts the existing NVLink1 code to use the
      new scheme. >From now on, each PE must have a valid
      iommu_table_group_ops which will either be called directly (for a
      single PE group) or indirectly from a compound group handlers.
      
      This moves IOMMU group registration for NVLink-connected GPUs to
      npu-dma.c. For POWER8, this stores a new compound group pointer in the
      PE (so a GPU is still a master); for POWER9 the new group pointer is
      stored in an NPU (which is allocated per a PCI host controller).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0bd97167
    • A
      powerpc/powernv/pseries: Rework device adding to IOMMU groups · c4e9d3c1
      Alexey Kardashevskiy 提交于
      The powernv platform registers IOMMU groups and adds devices to them
      from the pci_controller_ops::setup_bridge() hook except one case when
      virtual functions (SRIOV VFs) are added from a bus notifier.
      
      The pseries platform registers IOMMU groups from
      the pci_controller_ops::dma_bus_setup() hook and adds devices from
      the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
      used for powernv does not add devices for pseries though as
      __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
      
      Both platforms use iommu_add_device() which takes a device and expects
      it to have a valid IOMMU table struct with an iommu_table_group pointer
      which in turn points the iommu_group struct (which represents
      an IOMMU group). Although the helper seems easy to use, it relies on
      some pre-existing device configuration and associated data structures
      which it does not really need.
      
      This simplifies iommu_add_device() to take the table_group pointer
      directly. Pseries already has a table_group pointer handy and the bus
      notified is not used anyway. For powernv, this copies the existing bus
      notifier, makes it work for powernv only which means an easy way of
      getting to the table_group pointer. This was tested on VFs but should
      also support physical PCI hotplug.
      
      Since iommu_add_device() receives the table_group pointer directly,
      pseries does not do TCE cache invalidation (the hypervisor does) nor
      allow multiple groups per a VFIO container (in other words sharing
      an IOMMU table between partitionable endpoints), this removes
      iommu_table_group_link from pseries.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4e9d3c1
    • A
      powerpc/powernv/npu: Move OPAL calls away from context manipulation · 0e759bd7
      Alexey Kardashevskiy 提交于
      When introduced, the NPU context init/destroy helpers called OPAL which
      enabled/disabled PID (a userspace memory context ID) filtering in an NPU
      per a GPU; this was a requirement for P9 DD1.0. However newer chip
      revision added a PID wildcard support so there is no more need to
      call OPAL every time a new context is initialized. Also, since the PID
      wildcard support was added, skiboot does not clear wildcard entries
      in the NPU so these remain in the hardware till the system reboot.
      
      This moves LPID and wildcard programming to the PE setup code which
      executes once during the booting process so NPU2 context init/destroy
      won't need to do additional configuration.
      
      This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
      this is the way to tell if the NPU support is present and configured.
      
      This moves pnv_npu2_init() declaration as pseries should be able to use it.
      This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
      call that. This exports pnv_npu2_map_lpar_dev() as following patches
      will use it from the VFIO driver.
      
      While at it, replace redundant list_for_each_entry_safe() with
      a simpler list_for_each_entry().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0e759bd7
    • A
      powerpc/powernv: Move npu struct from pnv_phb to pci_controller · 46a1449d
      Alexey Kardashevskiy 提交于
      The powernv PCI code stores NPU data in the pnv_phb struct. The latter
      is referenced by pci_controller::private_data. We are going to have NPU2
      support in the pseries platform as well but it does not store any
      private_data in in the pci_controller struct; and even if it did,
      it would be a different data structure.
      
      This makes npu a pointer and stores it one level higher in
      the pci_controller struct.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      46a1449d
    • A
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy 提交于
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • A
      powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region · e0bf78b0
      Alexey Kardashevskiy 提交于
      Normally mm_iommu_get() should add a reference and mm_iommu_put() should
      remove it. However historically mm_iommu_find() does the referencing and
      mm_iommu_get() is doing allocation and referencing.
      
      We are going to add another helper to preregister device memory so
      instead of having mm_iommu_new() (which pre-registers the normal memory
      and references the region), we need separate helpers for pre-registering
      and referencing.
      
      This renames:
      - mm_iommu_get to mm_iommu_new;
      - mm_iommu_find to mm_iommu_get.
      
      This changes mm_iommu_get() to reference the region so the name now
      reflects what it does.
      
      This removes the check for exact match from mm_iommu_new() as we want it
      to fail on existing regions; mm_iommu_get() should be used instead.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e0bf78b0
    • F
      powerpc: generate uapi header and system call table files · ab66dcc7
      Firoz Khan 提交于
      System call table generation script must be run to gener-
      ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
      This patch will have changes which will invokes the script.
      
      This patch will generate unistd_32/64.h and syscall_table-
      _32/64/c32/spu.h files by the syscall table generation
      script invoked by parisc/Makefile and the generated files
      against the removed files must be identical.
      
      The generated uapi header file will be included in uapi/-
      asm/unistd.h and generated system call table header file
      will be included by kernel/systbl.S file.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ab66dcc7
    • F
      powerpc: split compat syscall table out from native table · fbf508da
      Firoz Khan 提交于
      PowerPC uses a syscall table with native and compat calls
      interleaved, which is a slightly simpler way to define two
      matching tables.
      
      As we move to having the tables generated, that advantage
      is no longer important, but the interleaved table gets in
      the way of using the same scripts as on the other archit-
      ectures.
      
      Split out a new compat_sys_call_table symbol that contains
      all the compat calls, and leave the main table for the nat-
      ive calls, to more closely match the method we use every-
      where else.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fbf508da
    • F
      powerpc: move macro definition from asm/systbl.h · a11b763d
      Firoz Khan 提交于
      Move the macro definition for compat_sys_sigsuspend from
      asm/systbl.h to the file which it is getting included.
      
      One of the patch in this patch series is generating uapi
      header and syscall table files. In order to come up with
      a common implimentation across all architecture, we need
      to do this change.
      
      This change will simplify the implementation of system
      call table generation script and help to come up a common
      implementation across all architecture.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a11b763d
    • F
      powerpc: add __NR_syscalls along with NR_syscalls · 8a19eeea
      Firoz Khan 提交于
      NR_syscalls macro holds the number of system call exist
      in powerpc architecture. We have to change the value of
      NR_syscalls, if we add or delete a system call.
      
      One of the patch in this patch series has a script which
      will generate a uapi header based on syscall.tbl file.
      The syscall.tbl file contains the number of system call
      information. So we have two option to update NR_syscalls
      value.
      
      1. Update NR_syscalls in asm/unistd.h manually by count-
         ing the no.of system calls. No need to update NR_sys-
         calls until we either add a new system call or delete
         existing system call.
      
      2. We can keep this feature in above mentioned script,
         that will count the number of syscalls and keep it in
         a generated file. In this case we don't need to expli-
         citly update NR_syscalls in asm/unistd.h file.
      
      The 2nd option will be the recommended one. For that, I
      added the __NR_syscalls macro in uapi/asm/unistd.h along
      with NR_syscalls asm/unistd.h. The macro __NR_syscalls
      also added for making the name convention same across all
      architecture. While __NR_syscalls isn't strictly part of
      the uapi, having it as part of the generated header to
      simplifies the implementation. We also need to enclose
      this macro with #ifdef __KERNEL__ to avoid side effects.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8a19eeea
    • R
      powerpc/pkeys: Fix handling of pkey state across fork() · 2cd4bd19
      Ram Pai 提交于
      Protection key tracking information is not copied over to the
      mm_struct of the child during fork(). This can cause the child to
      erroneously allocate keys that were already allocated. Any allocated
      execute-only key is lost aswell.
      
      Add code; called by dup_mmap(), to copy the pkey state from parent to
      child explicitly.
      
      This problem was originally found by Dave Hansen on x86, which turns
      out to be a problem on powerpc aswell.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2cd4bd19
    • S
      KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list() · 90165d3d
      Suraj Jitindar Singh 提交于
      Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
      nest_rmap list will traverse it, find the corresponding pte in the shadow
      page tables, and if it still maps the same host page update the rc bits
      accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      90165d3d
    • M
      powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. · 0db6896f
      Mahesh Salgaonkar 提交于
      For fadump to work successfully there should not be any holes in reserved
      memory ranges where kernel has asked firmware to move the content of old
      kernel memory in event of crash. Now that fadump uses CMA for reserved
      area, this memory area is now not protected from hot-remove operations
      unless it is cma allocated. Hence, fadump service can fail to re-register
      after the hot-remove operation, if hot-removed memory belongs to fadump
      reserved region. To avoid this make sure that memory from fadump reserved
      area is not hot-removable if fadump is registered.
      
      However, if user still wants to remove that memory, he can do so by
      manually stopping fadump service before hot-remove operation.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0db6896f
    • M
      powerpc/fadump: Reservationless firmware assisted dump · a4e92ce8
      Mahesh Salgaonkar 提交于
      One of the primary issues with Firmware Assisted Dump (fadump) on Power
      is that it needs a large amount of memory to be reserved. On large
      systems with TeraBytes of memory, this reservation can be quite
      significant.
      
      In some cases, fadump fails if the memory reserved is insufficient, or
      if the reserved memory was DLPAR hot-removed.
      
      In the normal case, post reboot, the preserved memory is filtered to
      extract only relevant areas of interest using the makedumpfile tool.
      While the tool provides flexibility to determine what needs to be part
      of the dump and what memory to filter out, all supported distributions
      default this to "Capture only kernel data and nothing else".
      
      We take advantage of this default and the Linux kernel's Contiguous
      Memory Allocator (CMA) to fundamentally change the memory reservation
      model for fadump.
      
      Instead of setting aside a significant chunk of memory nobody can use,
      this patch uses CMA instead, to reserve a significant chunk of memory
      that the kernel is prevented from using (due to MIGRATE_CMA), but
      applications are free to use it. With this fadump will still be able
      to capture all of the kernel memory and most of the user space memory
      except the user pages that were present in CMA region.
      
      Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           7557         193        6822          12         541        6725
      Swap:          4095           0        4095
      
      With this patch:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           8133         194        7464          12         475        7338
      Swap:          4095           0        4095
      
      Changes made here are completely transparent to how fadump has
      traditionally worked.
      
      Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
      CMA and its usage.
      
      TODO:
      - Handle case where CMA reservation spans nodes.
      Signed-off-by: NAnanth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4e92ce8
    • M
      powerpc/powernv: Move opal_power_control_init() call in opal_init(). · 08fb726d
      Mahesh Salgaonkar 提交于
      opal_power_control_init() depends on opal message notifier to be
      initialized, which is done in opal_init()->opal_message_init(). But both
      these initialization are called through machine initcalls and it all
      depends on in which order they being called. So far these are called in
      correct order (may be we got lucky) and never saw any issue. But it is
      clearer to control initialization order explicitly by moving
      opal_power_control_init() into opal_init().
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      08fb726d
  7. 20 12月, 2018 8 次提交
  8. 19 12月, 2018 8 次提交