1. 30 4月, 2013 21 次提交
    • A
      powerpc: Decode the pte-lp-encoding bits correctly. · b1022fbd
      Aneesh Kumar K.V 提交于
      We look at both the segment base page size and actual page size and store
      the pte-lp-encodings in an array per base page size.
      
      We also update all relevant functions to take actual page size argument
      so that we can use the correct PTE LP encoding in HPTE. This should also
      get the basic Multiple Page Size per Segment (MPSS) support. This is needed
      to enable THP on ppc64.
      
      [Fixed PR KVM build --BenH]
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b1022fbd
    • A
      powerpc: Use encode avpn where we need only avpn values · 74f227b2
      Aneesh Kumar K.V 提交于
      In all these cases we are doing something similar to
      
      HPTE_V_COMPARE(hpte_v, want_v) which ignores the HPTE_V_LARGE bit
      
      With MPSS support we would need actual page size to set HPTE_V_LARGE
      bit and that won't be available in most of these cases. Since we are ignoring
      HPTE_V_LARGE bit, use the  avpn value instead. There should not be any change
      in behaviour after this patch.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      74f227b2
    • A
      powerpc: Reduce PTE table memory wastage · 5c1f6ee9
      Aneesh Kumar K.V 提交于
      We allocate one page for the last level of linux page table. With THP and
      large page size of 16MB, that would mean we are wasting large part
      of that page. To map 16MB area, we only need a PTE space of 2K with 64K
      page size. This patch reduce the space wastage by sharing the page
      allocated for the last level of linux page table with multiple pmd
      entries. We call these smaller chunks PTE page fragments and allocated
      page, PTE page.
      
      In order to support systems which doesn't have 64K HPTE support, we also
      add another 2K to PTE page fragment. The second half of the PTE fragments
      is used for storing slot and secondary bit information of an HPTE. With this
      we now have a 4K PTE fragment.
      
      We use a simple approach to share the PTE page. On allocation, we bump the
      PTE page refcount to 16 and share the PTE page with the next 16 pte alloc
      request. This should help in the node locality of the PTE page fragment,
      assuming that the immediate pte alloc request will mostly come from the
      same NUMA node. We don't try to reuse the freed PTE page fragment. Hence
      we could be waisting some space.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5c1f6ee9
    • A
      powerpc: Move the pte free routines from common header · d614bb04
      Aneesh Kumar K.V 提交于
      Acked-by: NPaul Mackerras <paulus@samba.org>
      
      This patch moves the common code to 32/64 bit headers and also duplicate
      4K_PAGES and 64K_PAGES section. We will later change the 64 bit 64K_PAGES
      version to support smaller PTE fragments. The patch doesn't introduce
      any functional changes.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d614bb04
    • A
      powerpc: Reduce the PTE_INDEX_SIZE · 419df06e
      Aneesh Kumar K.V 提交于
      This make one PMD cover 16MB range. That helps in easier implementation of THP
      on power. THP core code make use of one pmd entry to track the hugepage and
      the range mapped by a single pmd entry should be equal to the hugepage size
      supported by the hardware.
      
      This also switch PGD to cover 16GB. That is needed so that we can simplify the
      hugetlb page walking code so that we have same pte format for explicit hugepage
      and THP hugepage.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      419df06e
    • A
      powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format · e2b3d202
      Aneesh Kumar K.V 提交于
      We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation.
      With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level.
      That means with 32 bit process we cannot allocate normal pages at
      all, because we cover the entire address space with one pgd entry. Fix this
      by switching to a new page table format for hugepages. With the new page table
      format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead
      we encode the PTE information directly at the directory level. This forces 16MB
      hugepage at PMD level. This will also make the page take walk much simpler later
      when we add the THP support.
      
      With the new table format we have 4 cases for pgds and pmds:
      (1) invalid (all zeroes)
      (2) pointer to next table, as normal; bottom 6 bits == 0
      (3) leaf pte for huge page, bottom two bits != 00
      (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e2b3d202
    • A
      powerpc: New hugepage directory format · cf9427b8
      Aneesh Kumar K.V 提交于
      Change the hugepage directory format so that we can have leaf ptes directly
      at page directory avoiding the allocation of hugepage directory.
      
      With the new table format we have 3 cases for pgds and pmds:
      (1) invalid (all zeroes)
      (2) pointer to next table, as normal; bottom 6 bits == 0
      (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
      
      Instead of storing shift value in hugepd pointer we use mmu_psize_def index
      so that we can fit all the supported hugepage size in 4 bits
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cf9427b8
    • A
      powerpc: Don't truncate pgd_index wrongly · 0e5f35d0
      Aneesh Kumar K.V 提交于
      With PGD_INDEX_SIZE set to 12 the existing macro doesn't work. Fix it to
      use PTRS_PER_PGD
      
      The idea originally was to have one more bit in the result of
      pgd_index() than PGD_INDEX_SIZE, so that if one had an address
      corresponding to the last PGD entry, and then incremented that address
      by PGD_SIZE, and took pgd_index() of that, you wouldn't end up with
      zero.  The commit that introduced that dates back to 2002, and the
      code that was sensitive to that edge case has long since been
      refactored (several times), so there is no need for it these days.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0e5f35d0
    • A
      powerpc: Don't hard code the size of pte page · cc3665a6
      Aneesh Kumar K.V 提交于
      USE PTRS_PER_PTE to indicate the size of pte page. To support THP,
      later patches will be changing PTRS_PER_PTE value.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cc3665a6
    • A
      powerpc: Save DAR and DSISR in pt_regs on MCE · ce54152f
      Aneesh Kumar K.V 提交于
      We were not saving DAR and DSISR on MCE. Save then and also print the values
      along with exception details in xmon.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce54152f
    • A
      powerpc: Use signed formatting when printing error · 4b8f63d9
      Aneesh Kumar K.V 提交于
      PAPR defines these errors as negative values. So print them accordingly
      for easy debugging.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4b8f63d9
    • N
      powerpc/pseries: Correct builds break when CONFIG_SMP not defined · 601abdc3
      Nathan Fontenot 提交于
      Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined.
      
      The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP
      is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP
      system doesn't really make sense.
      
      This patch ifdef's out the code making the affinity updates for PRRN events to
      fix the following build break.
      
      arch/powerpc/mm/numa.c: In function ‘stage_topology_update’:
      arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’
      arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast
      make[1]: *** [arch/powerpc/mm/numa.o] Error 1
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      601abdc3
    • K
      powerpc/booke: Remove obsolete macro FINISH_EXCEPTION · 177c1923
      Kevin Hao 提交于
      This is stale and not used by anyone now.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      177c1923
    • V
      powerpc/rtas_flash: Fix bad memory access · fb4696c3
      Vasant Hegde 提交于
      We use kmem_cache_alloc() to allocate memory to hold the new firmware
      which will be flashed. kmem_cache_alloc() calls rtas_block_ctor() to
      set memory to NULL. But these constructor is called only for newly
      allocated slabs.
      
      If we run below command multiple time without rebooting, allocator may
      allocate memory from the area which was free'd by kmem_cache_free and
      it will not call constructor. In this situation we may hit kernel oops.
      
      dd if=<fw image> of=/proc/ppc64/rtas/firmware_flash bs=4096
      
      oops message:
      -------------
      [ 1602.399755] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1602.399772] SMP NR_CPUS=1024 NUMA pSeries
      [ 1602.399779] Modules linked in: rtas_flash nfsd lockd auth_rpcgss nfs_acl sunrpc fuse loop dm_mod sg ipv6 ses enclosure ehea ehci_pci ohci_hcd ehci_hcd usbcore sd_mod usb_common crc_t10dif scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh ipr libata scsi_mod
      [ 1602.399817] NIP: d00000000a170b9c LR: d00000000a170b64 CTR: c00000000079cd58
      [ 1602.399823] REGS: c0000003b9937930 TRAP: 0300   Not tainted  (3.9.0-rc4-0.27-ppc64)
      [ 1602.399828] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 22000428  XER: 20000000
      [ 1602.399841] SOFTE: 1
      [ 1602.399844] CFAR: c000000000005f24
      [ 1602.399848] DAR: 8c2625a820631fef, DSISR: 40000000
      [ 1602.399852] TASK = c0000003b4520760[3655] 'dd' THREAD: c0000003b9934000 CPU: 3
      GPR00: 8c2625a820631fe7 c0000003b9937bb0 d00000000a179f28 d00000000a171f08
      GPR04: 0000000010040000 0000000000001000 c0000003b9937df0 c0000003b5fb2080
      GPR08: c0000003b58f7200 d00000000a179f28 c0000003b40058d4 c00000000079cd58
      GPR12: d00000000a171450 c000000007f40900 0000000000000005 0000000010178d20
      GPR16: 00000000100cb9d8 000000000000001d 0000000000000000 000000001003ffff
      GPR20: 0000000000000001 0000000000000000 00003fffa0b50d30 000000001001f010
      GPR24: 0000000010020888 0000000010040000 d00000000a171f08 d00000000a172808
      GPR28: 0000000000001000 0000000010040000 c0000003b4005880 8c2625a820631fe7
      [ 1602.399924] NIP [d00000000a170b9c] .rtas_flash_write+0x7c/0x1e8 [rtas_flash]
      [ 1602.399930] LR [d00000000a170b64] .rtas_flash_write+0x44/0x1e8 [rtas_flash]
      [ 1602.399934] Call Trace:
      [ 1602.399939] [c0000003b9937bb0] [d00000000a170b64] .rtas_flash_write+0x44/0x1e8 [rtas_flash] (unreliable)
      [ 1602.399948] [c0000003b9937c60] [c000000000282830] .proc_reg_write+0x90/0xe0
      [ 1602.399955] [c0000003b9937ce0] [c0000000001ff374] .vfs_write+0x114/0x238
      [ 1602.399961] [c0000003b9937d80] [c0000000001ff5d8] .SyS_write+0x70/0xe8
      [ 1602.399968] [c0000003b9937e30] [c000000000009cdc] syscall_exit+0x0/0xa0
      [ 1602.399973] Instruction dump:
      [ 1602.399977] eb698010 801b0028 2f80dcd6 419e00a4 2fbc0000 419e009c ebfb0030 2fbf0000
      [ 1602.399989] 409e0010 480000d8 60000000 7c1f0378 <e81f0008> 2fa00000 409efff4 e81f0000
      [ 1602.400012] ---[ end trace b4136d115dc31dac ]---
      [ 1602.402178]
      [ 1602.402185] Sending IPI to other CPUs
      [ 1602.403329] IPI complete
      
      This patch uses kmem_cache_zalloc() instead of kmem_cache_alloc() to
      allocate memory, which makes sure memory is set to 0 before using.
      Also removes rtas_block_ctor(), which is no longer required.
      Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fb4696c3
    • S
      powerpc: Fix build failure after merge of the cgroup tree · 9f3a90e8
      Stephen Rothwell 提交于
      After merging the cgroup tree, today's linux-next build (powerpc
      ppc64_defconfig) failed like this:
      
      arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
      arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
      arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
      arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]
      
      Caused by commit 30c05350 ("powerpc/pseries: Use stop machine to
      update cpu maps") from the powerpc tree interacting with (probably)
      commit ff794dea ("cpuset: remove include of cgroup.h from cpuset.h")
      from the cgroup tree.  Removing includes from header files is fraught
      with danger ...
      
      The former should have added an include of linux/slab.h to
      arch/powerpc/mm/numa.c.
      
      I have added the following merge fix patch for today (but it should be
      applied to the powerpc tree ASAP).
      
      From: Stephen Rothwell <sfr@canb.auug.org.au>
      Date: Mon, 29 Apr 2013 14:01:44 +1000
      Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including
       slab.h
      
      fixes these build errors:
      
      arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
      arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
      arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
      arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9f3a90e8
    • M
      powerpc: Fix usage of setup_pci_atmu() · d5bbe659
      Michael Neuling 提交于
      Linux next is currently failing to compile mpc85xx_defconfig with:
        arch/powerpc/sysdev/fsl_pci.c:944:2: error: too many arguments to function 'setup_pci_atmu'
      
      This is caused by (from Kumar's next branch):
        commit 34642bbb
        Author: Kumar Gala <galak@kernel.crashing.org>
        powerpc/fsl-pci: Keep PCI SoC controller registers in pci_controller
      
      Which changed definition of setup_pci_atmu() but didn't update one of
      the callers.  Below fixes this.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NKim Phillips <kim.phillips@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d5bbe659
    • M
      mm: use vm_unmapped_area() on powerpc architecture · fba2369e
      Michel Lespinasse 提交于
      Update the powerpc slice_get_unmapped_area function to make use of
      vm_unmapped_area() instead of implementing a brute force search.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Tested-by: N"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fba2369e
    • M
      mm: remove free_area_cache use in powerpc architecture · 34d07177
      Michel Lespinasse 提交于
      As all other architectures have been converted to use vm_unmapped_area(),
      we are about to retire the free_area_cache.
      
      This change simply removes the use of that cache in
      slice_get_unmapped_area(), which will most certainly have a
      performance cost. Next one will convert that function to use the
      vm_unmapped_area() infrastructure and regain the performance.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      34d07177
    • K
      powerpc/fsl-booke: add the reg prop for pci bridge device node for T4/B4 · 9e2ecdbb
      Kevin Hao 提交于
      The reg property in the pci bridge device node is used to bind this
      device node to the pci bridge device. Then all the pci devices under
      this bridge could use the interrupt maps defined in this device node
      to do the irq translation. So if this property is missed, the pci
      traditional irq mechanism will not work.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      9e2ecdbb
    • K
      powerpc/fsl-pci: don't unmap the PCI SoC controller registers in setup_pci_atmu · 04aa99cd
      Kevin Hao 提交于
      In patch 34642bbb (powerpc/fsl-pci: Keep PCI SoC controller registers in
      pci_controller) we choose to keep the map of the PCI SoC controller
      registers. But we missed to delete the unmap in setup_pci_atmu
      function. This will cause the following call trace once we access
      the PCI SoC controller registers later.
      
      Unable to handle kernel paging request for data at address 0x8000080080040f14
      Faulting instruction address: 0xc00000000002ea58
      Oops: Kernel access of bad area, sig: 11 [#1]
      SMP NR_CPUS=24 T4240 QDS
      Modules linked in:
      NIP: c00000000002ea58 LR: c00000000002eaf4 CTR: c00000000002eac0
      REGS: c00000017e10b4a0 TRAP: 0300   Not tainted  (3.9.0-rc1-00052-gfa3529f-dirty)
      MSR: 0000000080029000 <CE,EE,ME>  CR: 28adbe22  XER: 00000000
      SOFTE: 0
      DEAR: 8000080080040f14, ESR: 0000000000000000
      TASK = c00000017e100000[1] 'swapper/0' THREAD: c00000017e108000 CPU: 2
      GPR00: 0000000000000000 c00000017e10b720 c0000000009928d8 c00000017e578e00
      GPR04: 0000000000000000 000000000000000c 0000000000000001 c00000017e10bb40
      GPR08: 0000000000000000 8000080080040000 0000000000000000 0000000000000016
      GPR12: 0000000088adbe22 c00000000fffa800 c000000000001ba0 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000008a5b70
      GPR24: c0000000008af938 c0000000009a28d8 c0000000009bb5dc c00000017e10bb40
      GPR28: c00000017e32a400 c00000017e10bc00 c00000017e32a400 c00000017e578e00
      NIP [c00000000002ea58] .fsl_pcie_check_link+0x88/0xf0
      LR [c00000000002eaf4] .fsl_indirect_read_config+0x34/0xb0
      Call Trace:
      [c00000017e10b720] [c00000017e10b7a0] 0xc00000017e10b7a0 (unreliable)
      [c00000017e10ba30] [c00000000002eaf4] .fsl_indirect_read_config+0x34/0xb0
      [c00000017e10bad0] [c00000000033aa08] .pci_bus_read_config_byte+0x88/0xd0
      [c00000017e10bb90] [c00000000088d708] .pci_apply_final_quirks+0x9c/0x18c
      [c00000017e10bc40] [c0000000000013dc] .do_one_initcall+0x5c/0x1f0
      [c00000017e10bcf0] [c00000000086ebac] .kernel_init_freeable+0x180/0x26c
      [c00000017e10bdb0] [c000000000001bbc] .kernel_init+0x1c/0x460
      [c00000017e10be30] [c000000000000880] .ret_from_kernel_thread+0x64/0xe4
      Instruction dump:
      38210310 2b800015 4fdde842 7c600026 5463fffe e8010010 7c0803a6 4e800020
      60000000 60000000 e92301d0 7c0004ac <80690f14> 0c030000 4c00012c 38210310
      ---[ end trace 7a8fe0cbccb7d992 ]---
      
      Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Acked-by: NRoy Zang <tie-fei.zang@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      04aa99cd
    • Z
      powerpc/dts: Fix the dts for p1025rdb 36bit · 9aa171fb
      Zhicheng Fan 提交于
      Fix the following errors:
      
      Error: p1025rdb.dtsi:326.2-3 label or path, 'qe', not found
      Error: p1021si-post.dtsi:242.2-3 label or path, 'qe', not found
      FATAL ERROR: Syntax error parsing input tree
      Signed-off-by: NZhicheng Fan <B32736@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      9aa171fb
  2. 26 4月, 2013 19 次提交