1. 04 10月, 2014 10 次提交
    • L
      x86: efi: Format EFI memory type & attrs with efi_md_typeattr_format() · ace1d121
      Laszlo Ersek 提交于
      An example log excerpt demonstrating the change:
      
      Before the patch:
      
      > efi: mem00: type=7, attr=0xf, range=[0x0000000000000000-0x000000000009f000) (0MB)
      > efi: mem01: type=2, attr=0xf, range=[0x000000000009f000-0x00000000000a0000) (0MB)
      > efi: mem02: type=7, attr=0xf, range=[0x0000000000100000-0x0000000000400000) (3MB)
      > efi: mem03: type=2, attr=0xf, range=[0x0000000000400000-0x0000000000800000) (4MB)
      > efi: mem04: type=10, attr=0xf, range=[0x0000000000800000-0x0000000000808000) (0MB)
      > efi: mem05: type=7, attr=0xf, range=[0x0000000000808000-0x0000000000810000) (0MB)
      > efi: mem06: type=10, attr=0xf, range=[0x0000000000810000-0x0000000000900000) (0MB)
      > efi: mem07: type=4, attr=0xf, range=[0x0000000000900000-0x0000000001100000) (8MB)
      > efi: mem08: type=7, attr=0xf, range=[0x0000000001100000-0x0000000001400000) (3MB)
      > efi: mem09: type=2, attr=0xf, range=[0x0000000001400000-0x0000000002613000) (18MB)
      > efi: mem10: type=7, attr=0xf, range=[0x0000000002613000-0x0000000004000000) (25MB)
      > efi: mem11: type=4, attr=0xf, range=[0x0000000004000000-0x0000000004020000) (0MB)
      > efi: mem12: type=7, attr=0xf, range=[0x0000000004020000-0x00000000068ea000) (40MB)
      > efi: mem13: type=2, attr=0xf, range=[0x00000000068ea000-0x00000000068f0000) (0MB)
      > efi: mem14: type=3, attr=0xf, range=[0x00000000068f0000-0x0000000006c7b000) (3MB)
      > efi: mem15: type=6, attr=0x800000000000000f, range=[0x0000000006c7b000-0x0000000006c7d000) (0MB)
      > efi: mem16: type=5, attr=0x800000000000000f, range=[0x0000000006c7d000-0x0000000006c85000) (0MB)
      > efi: mem17: type=6, attr=0x800000000000000f, range=[0x0000000006c85000-0x0000000006c87000) (0MB)
      > efi: mem18: type=3, attr=0xf, range=[0x0000000006c87000-0x0000000006ca3000) (0MB)
      > efi: mem19: type=6, attr=0x800000000000000f, range=[0x0000000006ca3000-0x0000000006ca6000) (0MB)
      > efi: mem20: type=10, attr=0xf, range=[0x0000000006ca6000-0x0000000006cc6000) (0MB)
      > efi: mem21: type=6, attr=0x800000000000000f, range=[0x0000000006cc6000-0x0000000006d95000) (0MB)
      > efi: mem22: type=5, attr=0x800000000000000f, range=[0x0000000006d95000-0x0000000006e22000) (0MB)
      > efi: mem23: type=7, attr=0xf, range=[0x0000000006e22000-0x0000000007165000) (3MB)
      > efi: mem24: type=4, attr=0xf, range=[0x0000000007165000-0x0000000007d22000) (11MB)
      > efi: mem25: type=7, attr=0xf, range=[0x0000000007d22000-0x0000000007d25000) (0MB)
      > efi: mem26: type=3, attr=0xf, range=[0x0000000007d25000-0x0000000007ea2000) (1MB)
      > efi: mem27: type=5, attr=0x800000000000000f, range=[0x0000000007ea2000-0x0000000007ed2000) (0MB)
      > efi: mem28: type=6, attr=0x800000000000000f, range=[0x0000000007ed2000-0x0000000007ef6000) (0MB)
      > efi: mem29: type=7, attr=0xf, range=[0x0000000007ef6000-0x0000000007f00000) (0MB)
      > efi: mem30: type=9, attr=0xf, range=[0x0000000007f00000-0x0000000007f02000) (0MB)
      > efi: mem31: type=10, attr=0xf, range=[0x0000000007f02000-0x0000000007f06000) (0MB)
      > efi: mem32: type=4, attr=0xf, range=[0x0000000007f06000-0x0000000007fd0000) (0MB)
      > efi: mem33: type=6, attr=0x800000000000000f, range=[0x0000000007fd0000-0x0000000007ff0000) (0MB)
      > efi: mem34: type=7, attr=0xf, range=[0x0000000007ff0000-0x0000000008000000) (0MB)
      
      After the patch:
      
      > efi: mem00: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000000000-0x000000000009f000) (0MB)
      > efi: mem01: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x000000000009f000-0x00000000000a0000) (0MB)
      > efi: mem02: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000100000-0x0000000000400000) (3MB)
      > efi: mem03: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000400000-0x0000000000800000) (4MB)
      > efi: mem04: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000800000-0x0000000000808000) (0MB)
      > efi: mem05: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000808000-0x0000000000810000) (0MB)
      > efi: mem06: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000810000-0x0000000000900000) (0MB)
      > efi: mem07: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000900000-0x0000000001100000) (8MB)
      > efi: mem08: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000001100000-0x0000000001400000) (3MB)
      > efi: mem09: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000001400000-0x0000000002613000) (18MB)
      > efi: mem10: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000002613000-0x0000000004000000) (25MB)
      > efi: mem11: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000004000000-0x0000000004020000) (0MB)
      > efi: mem12: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000004020000-0x00000000068ea000) (40MB)
      > efi: mem13: [Loader Data        |   |  |  |  |   |WB|WT|WC|UC] range=[0x00000000068ea000-0x00000000068f0000) (0MB)
      > efi: mem14: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x00000000068f0000-0x0000000006c7b000) (3MB)
      > efi: mem15: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c7b000-0x0000000006c7d000) (0MB)
      > efi: mem16: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c7d000-0x0000000006c85000) (0MB)
      > efi: mem17: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c85000-0x0000000006c87000) (0MB)
      > efi: mem18: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006c87000-0x0000000006ca3000) (0MB)
      > efi: mem19: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006ca3000-0x0000000006ca6000) (0MB)
      > efi: mem20: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006ca6000-0x0000000006cc6000) (0MB)
      > efi: mem21: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006cc6000-0x0000000006d95000) (0MB)
      > efi: mem22: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000006d95000-0x0000000006e22000) (0MB)
      > efi: mem23: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000006e22000-0x0000000007165000) (3MB)
      > efi: mem24: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007165000-0x0000000007d22000) (11MB)
      > efi: mem25: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007d22000-0x0000000007d25000) (0MB)
      > efi: mem26: [Boot Code          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007d25000-0x0000000007ea2000) (1MB)
      > efi: mem27: [Runtime Code       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ea2000-0x0000000007ed2000) (0MB)
      > efi: mem28: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ed2000-0x0000000007ef6000) (0MB)
      > efi: mem29: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ef6000-0x0000000007f00000) (0MB)
      > efi: mem30: [ACPI Reclaim Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f00000-0x0000000007f02000) (0MB)
      > efi: mem31: [ACPI Memory NVS    |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f02000-0x0000000007f06000) (0MB)
      > efi: mem32: [Boot Data          |   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007f06000-0x0000000007fd0000) (0MB)
      > efi: mem33: [Runtime Data       |RUN|  |  |  |   |WB|WT|WC|UC] range=[0x0000000007fd0000-0x0000000007ff0000) (0MB)
      > efi: mem34: [Conventional Memory|   |  |  |  |   |WB|WT|WC|UC] range=[0x0000000007ff0000-0x0000000008000000) (0MB)
      
      Both the type enum and the attribute bitmap are decoded, with the
      additional benefit that the memory ranges line up as well.
      Signed-off-by: NLaszlo Ersek <lersek@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      ace1d121
    • D
      x86/efi: Clear EFI_RUNTIME_SERVICES if failing to enter virtual mode · a5a750a9
      Dave Young 提交于
      If enter virtual mode failed due to some reason other than the efi call
      the EFI_RUNTIME_SERVICES bit in efi.flags should be cleared thus users
      of efi runtime services can check the bit and handle the case instead of
      assume efi runtime is ok.
      
      Per Matt, if efi call SetVirtualAddressMap fails we will be not sure
      it's safe to make any assumptions about the state of the system. So
      kernel panics instead of clears EFI_RUNTIME_SERVICES bit.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      a5a750a9
    • D
      arm64/efi: Do not enter virtual mode if booting with efi=noruntime or noefi · 6632210f
      Dave Young 提交于
      In case efi runtime disabled via noefi kernel cmdline
      arm64_enter_virtual_mode should error out.
      
      At the same time move early_memunmap(memmap.map, mapsize) to the
      beginning of the function or it will leak early mem.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      6632210f
    • D
      arm64/efi: uefi_init error handling fix · 88f8abd5
      Dave Young 提交于
      There's one early memmap leak in uefi_init error path, fix it and
      slightly tune the error handling code.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Acked-by: NMark Salter <msalter@redhat.com>
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      88f8abd5
    • D
      efi: Add kernel param efi=noruntime · 5ae3683c
      Dave Young 提交于
      noefi kernel param means actually disabling efi runtime, Per suggestion
      from Leif Lindholm efi=noruntime should be better. But since noefi is
      already used in X86 thus just adding another param efi=noruntime for
      same purpose.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      5ae3683c
    • D
      lib: Add a generic cmdline parse function parse_option_str · 6ccc72b8
      Dave Young 提交于
      There should be a generic function to parse params like a=b,c
      Adding parse_option_str in lib/cmdline.c which will return true
      if there's specified option set in the params.
      
      Also updated efi=old_map parsing code to use the new function
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      6ccc72b8
    • D
      efi: Move noefi early param code out of x86 arch code · b2e0a54a
      Dave Young 提交于
      noefi param can be used for arches other than X86 later, thus move it
      out of x86 platform code.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      b2e0a54a
    • J
      efi-bgrt: Add error handling; inform the user when ignoring the BGRT · 1282278e
      Josh Triplett 提交于
      Gracefully handle failures to allocate memory for the image, which might
      be arbitrarily large.
      
      efi_bgrt_init can fail in various ways as well, usually because the
      BIOS-provided BGRT structure does not match expectations.  Add
      appropriate error messages rather than failing silently.
      Reported-by: NSrihari Vijayaraghavan <linux.bug.reporting@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81321Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      1282278e
    • M
      efi: Add efi= parameter parsing to the EFI boot stub · 5a17dae4
      Matt Fleming 提交于
      We need a way to customize the behaviour of the EFI boot stub, in
      particular, we need a way to disable the "chunking" workaround, used
      when reading files from the EFI System Partition.
      
      One of my machines doesn't cope well when reading files in 1MB chunks to
      a buffer above the 4GB mark - it appears that the "chunking" bug
      workaround triggers another firmware bug. This was only discovered with
      commit 4bf7111f ("x86/efi: Support initrd loaded above 4G"), and
      that commit is perfectly valid. The symptom I observed was a corrupt
      initrd rather than any kind of crash.
      
      efi= is now used to specify EFI parameters in two very different
      execution environments, the EFI boot stub and during kernel boot.
      
      There is also a slight performance optimization by enabling efi=nochunk,
      but that's offset by the fact that you're more likely to run into
      firmware issues, at least on x86. This is the rationale behind leaving
      the workaround enabled by default.
      
      Also provide some documentation for EFI_READ_CHUNK_SIZE and why we're
      using the current value of 1MB.
      Tested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Roy Franz <roy.franz@linaro.org>
      Cc: Maarten Lankhorst <m.b.lankhorst@gmail.com>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      5a17dae4
    • A
      efi: Implement mandatory locking for UEFI Runtime Services · 161485e8
      Ard Biesheuvel 提交于
      According to section 7.1 of the UEFI spec, Runtime Services are not fully
      reentrant, and there are particular combinations of calls that need to be
      serialized. Use a spinlock to serialize all Runtime Services with respect
      to all others, even if this is more than strictly needed.
      
      We've managed to get away without requiring a runtime services lock
      until now because most of the interactions with EFI involve EFI
      variables, and those operations are already serialised with
      __efivars->lock.
      
      Some of the assumptions underlying the decision whether locks are
      needed or not (e.g., SetVariable() against ResetSystem()) may not
      apply universally to all [new] architectures that implement UEFI.
      Rather than try to reason our way out of this, let's just implement at
      least what the spec requires in terms of locking.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      161485e8
  2. 16 8月, 2014 1 次提交
  3. 14 8月, 2014 4 次提交
  4. 13 8月, 2014 25 次提交
    • A
      powerpc/thp: Add tracepoints to track hugepage invalidate · 9e813308
      Aneesh Kumar K.V 提交于
      Add tracepoint to track hugepage invalidate. This help us
      in debugging difficult to track bugs.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9e813308
    • A
      powerpc/mm: Use read barrier when creating real_pte · 85c1fafd
      Aneesh Kumar K.V 提交于
      On ppc64 we support 4K hash pte with 64K page size. That requires
      us to track the hash pte slot information on a per 4k basis. We do that
      by storing the slot details in the second half of pte page. The pte bit
      _PAGE_COMBO is used to indicate whether the second half need to be
      looked while building real_pte. We need to use read memory barrier while
      doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO
      check. On the store side we already do a lwsync in __hash_page_4K
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      85c1fafd
    • A
      powerpc/thp: Use ACCESS_ONCE when loading pmdp · 7e467245
      Aneesh Kumar K.V 提交于
      We would get wrong results in compiler recomputed old_pmd. Avoid
      that by using ACCESS_ONCE
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7e467245
    • A
      powerpc/thp: Invalidate with vpn in loop · 969b7b20
      Aneesh Kumar K.V 提交于
      As per ISA, for 4k base page size we compare 14..65 bits of VA specified
      with the entry_VA in tlb. That implies we need to make sure we do a
      tlbie with all the possible 4k va we used to access the 16MB hugepage.
      With 64k base page size we compare 14..57 bits of VA. Hence we cannot
      ignore the lower 24 bits of va while tlbie .We also cannot tlb
      invalidate a 16MB entry with just one tlbie instruction because
      we don't track which va was used to instantiate the tlb entry.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      969b7b20
    • A
      powerpc/thp: Handle combo pages in invalidate · fc047955
      Aneesh Kumar K.V 提交于
      If we changed base page size of the segment, either via sub_page_protect
      or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
      table entries. We do a lazy hash page table flush for all mapped pages
      in the demoted segment. This happens when we handle hash page fault for
      these pages.
      
      We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
      pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
      that implies that we could possibly have older 64K hash pte entries in
      the hash page table and we need to invalidate those entries.
      
      Use _PAGE_COMBO to determine the page size with which we should
      invalidate the hash table entries on unmap.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fc047955
    • A
      powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte · 629149fa
      Aneesh Kumar K.V 提交于
      If we changed base page size of the segment, either via sub_page_protect
      or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
      table entries. We do a lazy hash page table flush for all mapped pages
      in the demoted segment. This happens when we handle hash page fault
      for these pages.
      
      We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
      pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
      that implies that we could possibly have older 64K hash pte entries in
      the hash page table and we need to invalidate those entries.
      
      Handle this correctly for 16M pages
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      629149fa
    • A
      powerpc/thp: Don't recompute vsid and ssize in loop on invalidate · fa1f8ae8
      Aneesh Kumar K.V 提交于
      The segment identifier and segment size will remain the same in
      the loop, So we can compute it outside. We also change the
      hugepage_invalidate interface so that we can use it the later patch
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fa1f8ae8
    • A
      powerpc/thp: Add write barrier after updating the valid bit · b0aa44a3
      Aneesh Kumar K.V 提交于
      With hugepages, we store the hpte valid information in the pte page
      whose address is stored in the second half of the PMD. Use a
      write barrier to make sure clearing pmd busy bit and updating
      hpte valid info are ordered properly.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b0aa44a3
    • N
      powerpc: reorder per-cpu NUMA information's initialization · 2fabf084
      Nishanth Aravamudan 提交于
      There is an issue currently where NUMA information is used on powerpc
      (and possibly ia64) before it has been read from the device-tree, which
      leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
      
      NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
      after start_secondary(), similar to ia64, which is invoked via
      smp_init().
      
      Commit 6ee0578b ("workqueue: mark init_workqueues() as
      early_initcall()") made init_workqueues() be invoked via
      do_pre_smp_initcalls(), which is obviously before the secondary
      processors are online.
      
      Additionally, the following commits changed init_workqueues() to use
      cpu_to_node to determine the node to use for kthread_create_on_node:
      
      bce90380 ("workqueue: add wq_numa_tbl_len and
      wq_numa_possible_cpumask[]")
      f3f90ad4 ("workqueue: determine NUMA node of workers accourding to
      the allowed cpumask")
      
      Therefore, when init_workqueues() runs, it sees all CPUs as being on
      Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
      a high number of slab deactivations
      (http://www.spinics.net/lists/linux-mm/msg67489.html).
      
      Fix this by initializing the powerpc-specific CPU<->node/local memory
      node mapping as early as possible, which on powerpc is
      do_init_bootmem(). Currently that function initializes the mapping for
      the boot CPU, but we extend it to setup the mapping for all possible
      CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
      per-cpu values for all possible CPUs. That ensures that before the
      early_initcalls run (and really as early as possible), the per-cpu NUMA
      mapping is accurate.
      
      While testing memoryless nodes on PowerKVM guests with a fix to the
      workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
      guest topology of:
      
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
      node 1 size: 16336 MB
      node 1 free: 15329 MB
      node distances:
      node   0   1
        0:  10  40
        1:  40  10
      
      the slab consumption decreases from
      
      Slab:             932416 kB
      SUnreclaim:       902336 kB
      
      to
      
      Slab:             395264 kB
      SUnreclaim:       359424 kB
      
      And we a corresponding increase in the slab efficiency from
      
      slab                                   mem     objs    slabs
                                            used   active   active
      ------------------------------------------------------------
      kmalloc-16384                       337 MB   11.28%  100.00%
      task_struct                         288 MB    9.93%  100.00%
      
      to
      
      slab                                   mem     objs    slabs
                                            used   active   active
      ------------------------------------------------------------
      kmalloc-16384                        37 MB  100.00%  100.00%
      task_struct                          31 MB  100.00%  100.00%
      
      Powerpc didn't support memoryless nodes until recently (64bb80d8
      "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261
      "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
      helped improve memory consumption with these kind of environments.
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fabf084
    • H
      powerpc/perf/hv-24x7: Use kmem_cache_free · d6589722
      Himangi Saraogi 提交于
      Free memory allocated using kmem_cache_zalloc using kmem_cache_free
      rather than kfree.
      
      The Coccinelle semantic patch that makes this change is as follows:
      
      // <smpl>
      @@
      expression x,E,c;
      @@
      
       x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
       ... when != x = E
           when != &x
      ?-kfree(x)
      +kmem_cache_free(c,x)
      // </smpl>
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d6589722
    • T
      powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info · 587870e8
      Thomas Falcon 提交于
      A buffer returned by H_VTERM_PARTNER_INFO contains device information
      in big endian format, causing problems for little endian architectures.
      This patch ensures that they are in cpu endian.
      Signed-off-by: NThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      587870e8
    • A
      powerpc: Hard disable interrupts in xmon · a71d64b4
      Anton Blanchard 提交于
      xmon only soft disables interrupts. This seems like a bad idea - we
      certainly don't want decrementer and PMU exceptions going off when
      we are debugging something inside xmon.
      
      This issue was uncovered when the hard lockup detector went off
      inside xmon. To ensure we wont get a spurious hard lockup warning,
      I also call touch_nmi_watchdog() when exiting xmon.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a71d64b4
    • N
      powerpc: remove duplicate definition of TEXASR_FS · 56758e3c
      Nishanth Aravamudan 提交于
      It appears that commits 7f06f21d ("powerpc/tm: Add checking to
      treclaim/trechkpt") and e4e38121 ("KVM: PPC: Book3S HV: Add
      transactional memory support") both added definitions of TEXASR_FS.
      Remove one of them. At the same time, fix the alignment of the remaining
      definition (should be tab-separated like the rest of the #defines).
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56758e3c
    • G
      powerpc/pseries: Avoid deadlock on removing ddw · 5efbabe0
      Gavin Shan 提交于
      Function remove_ddw() could be called in of_reconfig_notifier and
      we potentially remove the dynamic DMA window property, which invokes
      of_reconfig_notifier again. Eventually, it leads to the deadlock as
      following backtrace shows.
      
      The patch fixes the above issue by deferring releasing the dynamic
      DMA window property while releasing the device node.
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.16.0+ #428 Tainted: G        W
      ---------------------------------------------
      drmgr/2273 is trying to acquire lock:
       ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
       .__blocking_notifier_call_chain+0x40/0x78
      
      but task is already holding lock:
       ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
       .__blocking_notifier_call_chain+0x40/0x78
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock((of_reconfig_chain).rwsem);
        lock((of_reconfig_chain).rwsem);
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      2 locks held by drmgr/2273:
       #0:  (sb_writers#4){.+.+.+}, at: [<c0000000001cbe70>] \
            .vfs_write+0xb0/0x1f8
       #1:  ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
            .__blocking_notifier_call_chain+0x40/0x78
      
      stack backtrace:
      CPU: 17 PID: 2273 Comm: drmgr Tainted: G        W     3.16.0+ #428
      Call Trace:
      [c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
      [c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c
      [c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68
      [c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104
      [c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90
      [c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78
      [c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
      [c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
      [c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54
      [c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4
      [c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168
      [c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0
      [c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4
      [c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78
      [c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
      [c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
      [c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc
      [c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688
      [c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
      [c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
      [c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
      [c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5efbabe0
    • G
      powerpc/pseries: Failure on removing device node · f1b3929c
      Gavin Shan 提交于
      While running command "drmgr -c phb -r -s 'PHB 528'", following
      backtrace jumped out because the target device node isn't marked
      with OF_DETACHED by of_detach_node(), which caused by error
      returned from memory hotplug related reconfig notifier when
      disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it.
      
      ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0
      CPU: 14 PID: 2252 Comm: drmgr Tainted: G        W     3.16.0+ #427
      Call Trace:
      [c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
      [c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c
      [c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0
      [c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8
      [c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78
      [c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34
      [c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70
      [c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c
      [c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688
      [c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
      [c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
      [c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
      [c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f1b3929c
    • B
      powerpc/boot: Use correct zlib types for comparison · ce8f150a
      Benjamin Herrenschmidt 提交于
      Avoids this warning:
      
      arch/powerpc/boot/gunzip_util.c:118:9: warning: comparison of distinct pointer types lacks a cast
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce8f150a
    • V
      powerpc/powernv: Interface to register/unregister opal dump region · b09c2ec4
      Vasant Hegde 提交于
      PowerNV platform is capable of capturing host memory region when system
      crashes (because of host/firmware). We have new OPAL API to register/
      unregister memory region to be captured when system crashes.
      
      This patch adds support for new API. Also during boot time we register
      kernel log buffer and unregister before doing kexec.
      Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b09c2ec4
    • M
      powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS · 3609e09f
      Michael Ellerman 提交于
      We have been a bit slack about updating the CPU_FTRS_POSSIBLE and
      CPU_FTRS_ALWAYS masks. When we added POWER8, and also POWER8E we forgot
      to update the ALWAYS mask. And when we added POWER8_DD1 we forgot to
      update both the POSSIBLE and ALWAYS masks.
      
      Luckily this hasn't caused any actual bugs AFAICS. Failing to update the
      ALWAYS mask just forgoes a potential optimisation opportunity. Failing
      to update the POSSIBLE mask for POWER8_DD1 is also OK because it only
      removes a bit rather than adding any.
      
      Regardless they should all be in both masks so as to avoid any future
      bugs when the set of ALWAYS/POSSIBLE bits changes, or the masks
      themselves change.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NMichael Neuling <mikey@neuling.org>
      Acked-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3609e09f
    • A
      powerpc/ppc476: Disable BTAC · 97b3be1e
      Alistair Popple 提交于
      This patch disables the branch target address CAM which under specific
      circumstances may cause the processor to skip execution of 1-4
      instructions. This fixes IBM Erratum #47.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      97b3be1e
    • G
      powerpc/powernv: Fix IOMMU group lost · 763fe0ad
      Gavin Shan 提交于
      When we take full hotplug to recover from EEH errors, PCI buses
      could be involved. For the case, the child devices of involved
      PCI buses can't be attached to IOMMU group properly, which is
      caused by commit 3f28c5af ("powerpc/powernv: Reduce multi-hit of
      iommu_add_device()").
      
      When adding the PCI devices of the newly created PCI buses to
      the system, the IOMMU group is expected to be added in (C).
      (A) fails to bind the IOMMU group because bus->is_added is
      false. (B) fails because the device doesn't have binding IOMMU
      table yet. bus->is_added is set to true at end of (C) and
      pdev->is_added is set to true at (D).
      
         pcibios_add_pci_devices()
            pci_scan_bridge()
               pci_scan_child_bus()
                  pci_scan_slot()
                     pci_scan_single_device()
                        pci_scan_device()
                        pci_device_add()
                           pcibios_add_device()           A: Ignore
                           device_add()                   B: Ignore
                        pcibios_fixup_bus()
                           pcibios_setup_bus_devices()
                              pcibios_setup_device()      C: Hit
            pcibios_finish_adding_to_bus()
               pci_bus_add_devices()
                  pci_bus_add_device()                    D: Add device
      
      If the parent PCI bus isn't involved in hotplug, the IOMMU
      group is expected to be bound in (B). (A) should fail as the
      sysfs entries aren't populated.
      
      The patch fixes the issue by reverting commit 3f28c5af and remove
      WARN_ON() in iommu_add_device() to allow calling the function
      even the specified device already has associated IOMMU group.
      
      Cc: <stable@vger.kernel.org>  # 3.16+
      Reported-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      763fe0ad
    • M
      powerpc: Add smp_mb()s to arch_spin_unlock_wait() · 78e05b14
      Michael Ellerman 提交于
      Similar to the previous commit which described why we need to add a
      barrier to arch_spin_is_locked(), we have a similar problem with
      spin_unlock_wait().
      
      We need a barrier on entry to ensure any spinlock we have previously
      taken is visibly locked prior to the load of lock->slock.
      
      It's also not clear if spin_unlock_wait() is intended to have ACQUIRE
      semantics. For now be conservative and add a barrier on exit to give it
      ACQUIRE semantics.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      78e05b14
    • M
      powerpc: Add smp_mb() to arch_spin_is_locked() · 51d7d520
      Michael Ellerman 提交于
      The kernel defines the function spin_is_locked(), which can be used to
      check if a spinlock is currently locked.
      
      Using spin_is_locked() on a lock you don't hold is obviously racy. That
      is, even though you may observe that the lock is unlocked, it may become
      locked at any time.
      
      There is (at least) one exception to that, which is if two locks are
      used as a pair, and the holder of each checks the status of the other
      before doing any update.
      
      Assuming *A and *B are two locks, and *COUNTER is a shared non-atomic
      value:
      
      The first CPU does:
      
      	spin_lock(*A)
      
      	if spin_is_locked(*B)
      		# nothing
      	else
      		smp_mb()
      		LOAD r = *COUNTER
      		r++
      		STORE *COUNTER = r
      
      	spin_unlock(*A)
      
      And the second CPU does:
      
      	spin_lock(*B)
      
      	if spin_is_locked(*A)
      		# nothing
      	else
      		smp_mb()
      		LOAD r = *COUNTER
      		r++
      		STORE *COUNTER = r
      
      	spin_unlock(*B)
      
      Although this is a strange locking construct, it should work.
      
      It seems to be understood, but not documented, that spin_is_locked() is
      not a memory barrier, so in the examples above and below the caller
      inserts its own memory barrier before acting on the result of
      spin_is_locked().
      
      For now we assume spin_is_locked() is implemented as below, and we break
      it out in our examples:
      
      	bool spin_is_locked(*LOCK) {
      		LOAD l = *LOCK
      		return l.locked
      	}
      
      Our intuition is that there should be no problem even if the two code
      sequences run simultaneously such as:
      
      	CPU 0			CPU 1
      	==================================================
      	spin_lock(*A)		spin_lock(*B)
      	LOAD b = *B		LOAD a = *A
      	if b.locked # true	if a.locked # true
      	# nothing		# nothing
      	spin_unlock(*A)		spin_unlock(*B)
      
      If one CPU gets the lock before the other then it will do the update and
      the other CPU will back off:
      
      	CPU 0			CPU 1
      	==================================================
      	spin_lock(*A)
      	LOAD b = *B
      				spin_lock(*B)
      	if b.locked # false	LOAD a = *A
      	else			if a.locked # true
      	smp_mb()		# nothing
      	LOAD r1 = *COUNTER	spin_unlock(*B)
      	r1++
      	STORE *COUNTER = r1
      	spin_unlock(*A)
      
      However in reality spin_lock() itself is not indivisible. On powerpc we
      implement it as a load-and-reserve and store-conditional.
      
      Ignoring the retry logic for the lost reservation case, it boils down to:
      	spin_lock(*LOCK) {
      		LOAD l = *LOCK
      		l.locked = true
      		STORE *LOCK = l
      		ACQUIRE_BARRIER
      	}
      
      The ACQUIRE_BARRIER is required to give spin_lock() ACQUIRE semantics as
      defined in memory-barriers.txt:
      
           This acts as a one-way permeable barrier.  It guarantees that all
           memory operations after the ACQUIRE operation will appear to happen
           after the ACQUIRE operation with respect to the other components of
           the system.
      
      On modern powerpc systems we use lwsync for ACQUIRE_BARRIER. lwsync is
      also know as "lightweight sync", or "sync 1".
      
      As described in Power ISA v2.07 section B.2.1.1, in this scenario the
      lwsync is not the barrier itself. It instead causes the LOAD of *LOCK to
      act as the barrier, preventing any loads or stores in the locked region
      from occurring prior to the load of *LOCK.
      
      Whether this behaviour is in accordance with the definition of ACQUIRE
      semantics in memory-barriers.txt is open to discussion, we may switch to
      a different barrier in future.
      
      What this means in practice is that the following can occur:
      
      	CPU 0			CPU 1
      	==================================================
      	LOAD a = *A 		LOAD b = *B
      	a.locked = true		b.locked = true
      	LOAD b = *B		LOAD a = *A
      	STORE *A = a		STORE *B = b
      	if b.locked # false	if a.locked # false
      	else			else
      	smp_mb()		smp_mb()
      	LOAD r1 = *COUNTER	LOAD r2 = *COUNTER
      	r1++			r2++
      	STORE *COUNTER = r1
      				STORE *COUNTER = r2	# Lost update
      	spin_unlock(*A)		spin_unlock(*B)
      
      That is, the load of *B can occur prior to the store that makes *A
      visibly locked. And similarly for CPU 1. The result is both CPUs hold
      their lock and believe the other lock is unlocked.
      
      The easiest fix for this is to add a full memory barrier to the start of
      spin_is_locked(), so adding to our previous definition would give us:
      
      	bool spin_is_locked(*LOCK) {
      		smp_mb()
      		LOAD l = *LOCK
      		return l.locked
      	}
      
      The new barrier orders the store to the lock we are locking vs the load
      of the other lock:
      
      	CPU 0			CPU 1
      	==================================================
      	LOAD a = *A 		LOAD b = *B
      	a.locked = true		b.locked = true
      	STORE *A = a		STORE *B = b
      	smp_mb()		smp_mb()
      	LOAD b = *B		LOAD a = *A
      	if b.locked # true	if a.locked # true
      	# nothing		# nothing
      	spin_unlock(*A)		spin_unlock(*B)
      
      Although the above example is theoretical, there is code similar to this
      example in sem_lock() in ipc/sem.c. This commit in addition to the next
      commit appears to be a fix for crashes we are seeing in that code where
      we believe this race happens in practice.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      51d7d520
    • G
      powerpc: Fix "attempt to move .org backwards" error · 11d54904
      Guenter Roeck 提交于
      Once again, we see
      
      arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
      arch/powerpc/kernel/exceptions-64s.S:865: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:866: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:890: Error: attempt to move .org backwards
      
      when compiling ppc:allmodconfig.
      
      This time the problem has been caused by to commit 0869b6fd
      ("powerpc/book3s: Add basic infrastructure to handle HMI in Linux"),
      which adds functions hmi_exception_early and hmi_exception_after_realmode
      into a critical (size-limited) code area, even though that does not appear
      to be necessary.
      
      Move those functions to a non-critical area of the file.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      11d54904
    • S
      powerpc/nohash: Split __early_init_mmu() into boot and secondary · 5d61a217
      Scott Wood 提交于
      __early_init_mmu() does some things that are really only needed by the
      boot cpu.  On FSL booke, This includes calling
      memblock_enforce_memory_limit(), which is labelled __init.  Secondary
      cpu init code can't be __init as that would break CPU hotplug.
      
      While it's probably a bug that memblock_enforce_memory_limit() isn't
      __init_memblock instead, there's no reason why we should be doing this
      stuff for secondary cpus in the first place.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5d61a217
    • B
      PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use · 9baa3c34
      Benoit Taine 提交于
      We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to
      meet kernel coding style guidelines.  This issue was reported by checkpatch.
      
      A simplified version of the semantic patch that makes this change is as
      follows (http://coccinelle.lip6.fr/):
      
      // <smpl>
      
      @@
      identifier i;
      declarer name DEFINE_PCI_DEVICE_TABLE;
      initializer z;
      @@
      
      - DEFINE_PCI_DEVICE_TABLE(i)
      + const struct pci_device_id i[]
      = z;
      
      // </smpl>
      
      [bhelgaas: add semantic patch]
      Signed-off-by: NBenoit Taine <benoit.taine@lip6.fr>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      9baa3c34