1. 09 3月, 2012 2 次提交
    • B
      powerpc: Call do_page_fault() with interrupts off · a546498f
      Benjamin Herrenschmidt 提交于
      We currently turn interrupts back to their previous state before
      calling do_page_fault(). This can be annoying when debugging as
      a bad fault will potentially have lost some processor state before
      getting into the debugger.
      
      We also end up calling some generic code with interrupts enabled
      such as notify_page_fault() with interrupts enabled, which could
      be unexpected.
      
      This changes our code to behave more like other architectures,
      and make the assembly entry code call into do_page_faults() with
      interrupts disabled. They are conditionally re-enabled from
      within do_page_fault() in the same spot x86 does it.
      
      While there, add the might_sleep() test in the case of a successful
      trylock of the mmap semaphore, again like x86.
      
      Also fix a bug in the existing assembly where r12 (_MSR) could get
      clobbered by C calls (the DTL accounting in the exception common
      macro and DISABLE_INTS) in some cases.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2. Add the r12 clobber fix
      a546498f
    • B
      powerpc: Remove legacy iSeries bits from assembly files · 4f8cf36f
      Benjamin Herrenschmidt 提交于
      This removes the various bits of assembly in the kernel entry,
      exception handling and SLB management code that were specific
      to running under the legacy iSeries hypervisor which is no
      longer supported.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4f8cf36f
  2. 07 3月, 2012 2 次提交
  3. 23 2月, 2012 1 次提交
    • M
      fadump: Register for firmware assisted dump. · 3ccc00a7
      Mahesh Salgaonkar 提交于
      On 2012-02-20 11:02:51 Mon, Paul Mackerras wrote:
      > On Thu, Feb 16, 2012 at 04:44:30PM +0530, Mahesh J Salgaonkar wrote:
      >
      > If I have read the code correctly, we are going to get this printk on
      > non-pSeries machines or on older pSeries machines, even if the user
      > has not put the fadump=on option on the kernel command line.  The
      > printk will be annoying since there is no actual error condition.  It
      > seems to me that the condition for the printk should include
      > fw_dump.fadump_enabled.  In other words you should probably add
      >
      > 	if (!fw_dump.fadump_enabled)
      > 		return 0;
      >
      > at the beginning of the function.
      
      Hi Paul,
      
      Thanks for pointing it out. Please find the updated patch below.
      
      The existing patches above this (4/10 through 10/10) cleanly applies
      on this update.
      
      Thanks,
      -Mahesh.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3ccc00a7
  4. 13 1月, 2012 1 次提交
  5. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  6. 20 12月, 2011 2 次提交
    • S
      powerpc: Define virtual-physical translations for RELOCATABLE · 368ff8f1
      Suzuki Poulose 提交于
      We find the runtime address of _stext and relocate ourselves based
      on the following calculation.
      
      	virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
      			MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)
      
      relocate() is called with the Effective Virtual Base Address (as
      shown below)
      
                  | Phys. Addr| Virt. Addr |
      Page        |------------------------|
      Boundary    |           |            |
                  |           |            |
                  |           |            |
      Kernel Load |___________|_ __ _ _ _ _|<- Effective
      Addr(_stext)|           |      ^     |Virt. Base Addr
                  |           |      |     |
                  |           |      |     |
                  |           |reloc_offset|
                  |           |      |     |
                  |           |      |     |
                  |           |______v_____|<-(KERNELBASE)%TLB_SIZE
                  |           |            |
                  |           |            |
                  |           |            |
      Page        |-----------|------------|
      Boundary    |           |            |
      
      On BookE, we need __va() & __pa() early in the boot process to access
      the device tree.
      
      Currently this has been defined as :
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
      						PHYSICAL_START + KERNELBASE)
      where:
       PHYSICAL_START is kernstart_addr - a variable updated at runtime.
       KERNELBASE	is the compile time Virtual base address of kernel.
      
      This won't work for us, as kernstart_addr is dynamic and will yield different
      results for __va()/__pa() for same mapping.
      
      e.g.,
      
      Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
      PAGE_OFFSET).
      
      In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M
      
      Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
      		= 0xbc100000 , which is wrong.
      
      it should be : 0xc0000000 + 0x100000 = 0xc0100000
      
      On platforms which support AMP, like PPC_47x (based on 44x), the kernel
      could be loaded at highmem. Hence we cannot always depend on the compile
      time constants for mapping.
      
      Here are the possible solutions:
      
      1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
      compile time KERNELBASE value, instead of the actual Physical_Address(_stext).
      
      The disadvantage is that we may break other users of PHYSICAL_START. They
      could be replaced with __pa(_stext).
      
      2) Redefine __va() & __pa() with relocation offset
      
      #ifdef	CONFIG_RELOCATABLE_PPC32
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET)))
      #define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET))
      #endif
      
      where, RELOC_OFFSET could be
      
        a) A variable, say relocation_offset (like kernstart_addr), updated
           at boot time. This impacts performance, as we have to load an additional
           variable from memory.
      
      		OR
      
        b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \
                            (KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK))
      
         This introduces more calculations for doing the translation.
      
      3) Redefine __va() & __pa() with a new variable
      
      i.e,
      
      #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET))
      
      where VIRT_PHYS_OFFSET :
      
      #ifdef CONFIG_RELOCATABLE_PPC32
      #define VIRT_PHYS_OFFSET virt_phys_offset
      #else
      #define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START)
      #endif /* CONFIG_RELOCATABLE_PPC32 */
      
      where virt_phy_offset is updated at runtime to :
      
      	Effective KERNELBASE - kernstart_addr.
      
      Taking our example, above:
      
      virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr
      		 = 0xc0400000 - 0x400000
      		 = 0xc0000000
      	and
      
      	__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
      	 which is what we want.
      
      I have implemented (3) in the following patch which has same cost of
      operation as the existing one.
      
      I have tested the patches on 440x platforms only. However this should
      work fine for PPC_47x also, as we only depend on the runtime address
      and the current TLB XLAT entry for the startup code, which is available
      in r25. I don't have access to a 47x board yet. So, it would be great if
      somebody could test this on 47x.
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      368ff8f1
    • S
      powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE · 0f890c8d
      Suzuki Poulose 提交于
      The current implementation of CONFIG_RELOCATABLE in BookE is based
      on mapping the page aligned kernel load address to KERNELBASE. This
      approach however is not enough for platforms, where the TLB page size
      is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
      currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.
      
      The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
      dynamic relocations will be introduced in the later in the patch series.
      
      This change would allow the use of the old method of RELOCATABLE for
      platforms which can afford to enforce the page alignment (platforms with
      smaller TLB size).
      
      Changes since v3:
      
      * Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
        either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
      Suggested-by: NScott Wood <scottwood@freescale.com>
      Tested-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      0f890c8d
  7. 19 12月, 2011 2 次提交
  8. 09 12月, 2011 4 次提交
    • C
      powerpc/44x: Removing dead CONFIG_PPC47x · ca899859
      Christoph Egger 提交于
      CONFIG_PPC47x doesn't exist in Kconfig and no 476 processor calls this
      function ppc44x_pin_tlb() as it has it's own ppc47x_pin_tlb().
      
      This code is probably an artifact of the original 476 code that
      shouldn't have made it upstream.
      Signed-off-by: NChristoph Egger <siccegge@cs.fau.de>
      Signed-off-by: NTony Breeds <tony@bakeyournoodle.com>
      Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
      ca899859
    • T
      powerpc: Use HAVE_MEMBLOCK_NODE_MAP · 1d7cfe18
      Tejun Heo 提交于
      powerpc doesn't access early_node_map[] directly and enabling
      HAVE_MEMBLOCK_NODE_MAP is trivial - replacing add_active_range() calls
      with memblock_set_node() and selecting HAVE_MEMBLOCK_NODE_MAP is
      enough.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      1d7cfe18
    • T
      memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users · 1aadc056
      Tejun Heo 提交于
      The only function of memblock_analyze() is now allowing resize of
      memblock region arrays.  Rename it to memblock_allow_resize() and
      update its users.
      
      * The following users remain the same other than renaming.
      
        arm/mm/init.c::arm_memblock_init()
        microblaze/kernel/prom.c::early_init_devtree()
        powerpc/kernel/prom.c::early_init_devtree()
        openrisc/kernel/prom.c::early_init_devtree()
        sh/mm/init.c::paging_init()
        sparc/mm/init_64.c::paging_init()
        unicore32/mm/init.c::uc32_memblock_init()
      
      * In the following users, analyze was used to update total size which
        is no longer necessary.
      
        powerpc/kernel/machine_kexec.c::reserve_crashkernel()
        powerpc/kernel/prom.c::early_init_devtree()
        powerpc/mm/init_32.c::MMU_init()
        powerpc/mm/tlb_nohash.c::__early_init_mmu()  
        powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
        powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
        sh/kernel/machine_kexec.c::reserve_crashkernel()
      
      * x86/kernel/e820.c::memblock_x86_fill() was directly setting
        memblock_can_resize before populating memblock and calling analyze
        afterwards.  Call memblock_allow_resize() before start populating.
      
      memblock_can_resize is now static inside memblock.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      1aadc056
    • T
      powerpc: Cleanup memblock usage · 6fbef13c
      Tejun Heo 提交于
      * early_init_devtree(): Total memory size is aligned to PAGE_SIZE;
        however, alignment isn't enforced if memory_limit is explicitly
        specified.  Simplify the logic and always apply PAGE_SIZE alignment.
      
      * MMU_init(): memblock regions is truncated by directly modifying
        memblock.memory.cnt.  This is incomplete (reserved array is not
        truncated) and unnecessarily low level hindering further memblock
        improvments.  Use memblock_enforce_memory_limit() instead.
      
      * wii_memory_fixups(): Unnecessarily low level direct manipulation of
        memblock regions.  The same result can be achieved using properly
        abstracted operations.  Reimplement using memblock API.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      6fbef13c
  9. 08 12月, 2011 1 次提交
    • S
      powerpc: Punch a hole in /dev/mem for librtas · 8a3e3d31
      sukadev@linux.vnet.ibm.com 提交于
      With CONFIG_STRICT_DEVMEM=y, user space cannot read any part of /dev/mem.
      Since this breaks librtas, punch a hole in /dev/mem to allow access to the
      rmo_buffer that librtas needs.
      
      Anton Blanchard reported the problem and helped with the fix.
      
      A quick test for this patch:
      
             # cat /proc/rtas/rmo_buffer
             000000000f190000 10000
      
             # python -c "print 0x000000000f190000 / 0x10000"
             3865
      
             # dd if=/dev/mem of=/tmp/foo count=1 bs=64k skip=3865
             1+0 records in
             1+0 records out
             65536 bytes (66 kB) copied, 0.000205235 s, 319 MB/s
      
             # dd if=/dev/mem of=/tmp/foo
             dd: reading `/dev/mem': Operation not permitted
             0+0 records in
             0+0 records out
             0 bytes (0 B) copied, 0.00022519 s, 0.0 kB/s
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8a3e3d31
  10. 07 12月, 2011 8 次提交
  11. 02 12月, 2011 1 次提交
  12. 28 11月, 2011 2 次提交
  13. 25 11月, 2011 3 次提交
  14. 08 11月, 2011 2 次提交
  15. 03 11月, 2011 7 次提交
    • A
      thp: share get_huge_page_tail() · b35a35b5
      Andrea Arcangeli 提交于
      This avoids duplicating the function in every arch gup_fast.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35a35b5
    • A
      powerpc: gup_huge_pmd() return 0 if pte changes · cf592bf7
      Andrea Arcangeli 提交于
      powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
      it should also return zero to avoid adding pages to the array at the wrong
      offset.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf592bf7
    • A
      powerpc: gup_hugepte() support THP based tail recounting · 3526741f
      Andrea Arcangeli 提交于
      Up to this point the code assumed old refcounting for hugepages (pre-thp).
      This updates the code directly to the thp mapcount tail page refcounting.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3526741f
    • A
      powerpc: gup_hugepte() avoid freeing the head page too many times · 85964684
      Andrea Arcangeli 提交于
      We only taken "refs" pins on the head page not "*nr" pins.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85964684
    • A
      powerpc: get_hugepte() don't put_page() the wrong page · 405e44f2
      Andrea Arcangeli 提交于
      "page" may have changed to point to the next hugepage after the loop
      completed, The references have been taken on the head page, so the
      put_page must happen there too.
      
      This is a longstanding issue pre-thp inclusion.
      
      It's totally unclear how these page_cache_add_speculative and
      pte_val(pte) != pte_val(*ptep) checks are necessary across all the
      powerpc gup_fast code, when x86 doesn't need any of that: there's no way
      the page can be freed with irq disabled so we're guaranteed the
      atomic_inc will happen on a page with page_count > 0 (so not needing the
      speculative check).
      
      The pte check is also meaningless on x86: no need to rollback on x86 if
      the pte changed, because the pte can still change a CPU tick after the
      check succeeded and it won't be rolled back in that case.  The important
      thing is we got a reference on a valid page that was mapped there a CPU
      tick ago.  So not knowing the soft tlb refill code of ppc64 in great
      detail I'm not removing the "speculative" page_count increase and the
      pte checks across all the code, but unless there's a strong reason for
      it they should be later cleaned up too.
      
      If a pte can change from huge to non-huge (like it could happen with
      THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
      the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
      only so I'm not altering that.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      405e44f2
    • A
      powerpc: remove superfluous PageTail checks on the pte gup_fast · 2839bdc1
      Andrea Arcangeli 提交于
      This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
      those should be handled by gup_hugepd only, so these checks are
      superfluous.
      
      Plus if this wasn't a noop, it would have oopsed because, the insistence
      of using the speculative refcounting would trigger a VM_BUG_ON if a tail
      page was encountered in the page_cache_get_speculative().
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2839bdc1
    • A
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli 提交于
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94
  16. 01 11月, 2011 1 次提交