1. 09 10月, 2012 12 次提交
    • D
      sparc64: Support transparent huge pages. · 9e695d2e
      David Miller 提交于
      This is relatively easy since PMD's now cover exactly 4MB of memory.
      
      Our PMD entries are 32-bits each, so we use a special encoding.  The
      lowest bit, PMD_ISHUGE, determines the interpretation.  This is possible
      because sparc64's page tables are purely software entities so we can use
      whatever encoding scheme we want.  We just have to make the TLB miss
      assembler page table walkers aware of the layout.
      
      set_pmd_at() works much like set_pte_at() but it has to operate in two
      page from a table of non-huge PTEs, so we have to queue up TLB flushes
      based upon what mappings are valid in the PTE table.  In the second regime
      we are going from huge-page to non-huge-page, and in that case we need
      only queue up a single TLB flush to push out the huge page mapping.
      
      We still have 5 bits remaining in the huge PMD encoding so we can very
      likely support any new pieces of THP state tracking that might get added
      in the future.
      
      With lots of help from Johannes Weiner.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e695d2e
    • D
      sparc64: Document PGD and PMD layout. · dbc9fdf0
      David Miller 提交于
      We're going to be messing around with the PMD interpretation and layout
      for the sake of transparent huge pages, so we better clearly document what
      we're starting with.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbc9fdf0
    • D
      sparc64: Eliminate PTE table memory wastage. · c460bec7
      David Miller 提交于
      We've split up the PTE tables so that they take up half a page instead of
      a full page.  This is in order to facilitate transparent huge page
      support, which works much better if our PMDs cover 4MB instead of 8MB.
      
      What we do is have a one-behind cache for PTE table allocations in the
      mm struct.
      
      This logic triggers only on allocations.  For example, we don't try to
      keep track of free'd up page table blocks in the style that the s390 port
      does.
      
      There were only two slightly annoying aspects to this change:
      
      1) Changing pgtable_t to be a "pte_t *".  There's all of this special
         logic in the TLB free paths that needed adjustments, as did the
         PMD populate interfaces.
      
      2) init_new_context() needs to zap the pointer, since the mm struct
         just gets copied from the parent on fork.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c460bec7
    • D
      sparc64: Halve the size of PTE tables · 56a70b8c
      David Miller 提交于
      The reason we want to do this is to facilitate transparent huge page
      support.
      
      Right now PMD's cover 8MB of address space, and our huge page size is 4MB.
       The current transparent hugepage support is not able to handle HPAGE_SIZE
      != PMD_SIZE.
      
      So make PTE tables be sized to half of a page instead of a full page.
      
      We can still map properly the whole supported virtual address range which
      on sparc64 requires 44 bits.  Add a compile time CPP test which ensures
      that this requirement is always met.
      
      There is a minor inefficiency added by this change.  We only use half of
      the page for PTE tables.  It's not trivial to use only half of the page
      yet still get all of the pgtable_page_{ctor,dtor}() stuff working
      properly.  It is doable, and that will come in a subsequent change.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56a70b8c
    • D
      sparc64: Only support 4MB huge pages and 8KB base pages. · 15b9350a
      David Miller 提交于
      Narrowing the scope of the page size configurations will make the
      transparent hugepage changes much simpler.
      
      In the end what we really want to do is have the kernel support multiple
      huge page sizes and use whatever is appropriate as the context dictactes.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15b9350a
    • S
      readahead: fault retry breaks mmap file read random detection · 45cac65b
      Shaohua Li 提交于
      .fault now can retry.  The retry can break state machine of .fault.  In
      filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
      try, since the page is in page cache now, ra->mmap_miss is decreased.  And
      these are done in one fault, so we can't detect random mmap file access.
      
      Add a new flag to indicate .fault is tried once.  In the second try, skip
      ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.
      
      I only tested x86, didn't test other archs, but looks the change for other
      archs is obvious, but who knows :)
      Signed-off-by: NShaohua Li <shaohua.li@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cac65b
    • W
      mm: hugetlb: add arch hook for clearing page flags before entering pool · 5d3a551c
      Will Deacon 提交于
      The core page allocator ensures that page flags are zeroed when freeing
      pages via free_pages_check.  A number of architectures (ARM, PPC, MIPS)
      rely on this property to treat new pages as dirty with respect to the data
      cache and perform the appropriate flushing before mapping the pages into
      userspace.
      
      This can lead to cache synchronisation problems when using hugepages,
      since the allocator keeps its own pool of pages above the usual page
      allocator and does not reset the page flags when freeing a page into the
      pool.
      
      This patch adds a new architecture hook, arch_clear_hugepage_flags, so
      that architectures which rely on the page flags being in a particular
      state for fresh allocations can adjust the flags accordingly when a page
      is freed into the pool.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d3a551c
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • C
      Kconfig: clean up the "#if defined(arch)" list for exception-trace sysctl entry · 7ac57a89
      Catalin Marinas 提交于
      Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
      architectures requiring support for the "exception-trace" debug_table
      entry in kernel/sysctl.c.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ac57a89
    • C
      Kconfig: clean up the long arch list for the DEBUG_BUGVERBOSE config option · 9b2a60c4
      Catalin Marinas 提交于
      Introduce HAVE_DEBUG_BUGVERBOSE config option and select it in
      corresponding architecture Kconfig files.  Architectures that already
      select GENERIC_BUG don't need to select HAVE_DEBUG_BUGVERBOSE.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b2a60c4
    • C
      Kconfig: clean up the long arch list for the DEBUG_KMEMLEAK config option · b69ec42b
      Catalin Marinas 提交于
      Introduce HAVE_DEBUG_KMEMLEAK config option and select it in corresponding
      architecture Kconfig files.  DEBUG_KMEMLEAK now only depends on
      HAVE_DEBUG_KMEMLEAK.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b69ec42b
    • C
      Kconfig: clean up the long arch list for the UID16 config option · af1839eb
      Catalin Marinas 提交于
      Introduce HAVE_UID16 config option and select it in corresponding
      architecture Kconfig files.  UID16 now only depends on HAVE_UID16.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af1839eb
  2. 06 10月, 2012 4 次提交
    • D
      Revert strace hiccups fix. · 2863bc54
      David S. Miller 提交于
      This reverts commit 40138249 and
      ffa9009c.
      
      There are problems with how the flag bytes were rearranged, in
      particular we really can't move values down into the lowest
      16 bits since those are used for individual state bits.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2863bc54
    • D
      sparc64: Niagara-4 bzero/memset, plus use MRU stores in page copy. · 9f825962
      David S. Miller 提交于
      This adds optimized memset/bzero/page-clear routines for Niagara-4.
      
      We basically can do what powerpc has been able to do for a decade (via
      the "dcbz" instruction), which is use cache line clearing stores for
      bzero and memsets with a 'c' argument of zero.
      
      As long as we make the cache initializing store to each 32-byte
      subblock of the L2 cache line, it works.
      
      As with other Niagara-4 optimized routines, the key is to make sure to
      avoid any usage of the %asi register, as reads and writes to it cost
      at least 50 cycles.
      
      For the user clear cases, we don't use these new routines, we use the
      Niagara-1 variants instead.  Those have to use %asi in an unavoidable
      way.
      
      A Niagara-4 8K page clear costs just under 600 cycles.
      
      Add definitions of the MRU variants of the cache initializing store
      ASIs.  By default, cache initializing stores install the line as Least
      Recently Used.  If we know we're going to use the data immediately
      (which is true for page copies and clears) we can use the Most
      Recently Used variant, to decrease the likelyhood of the lines being
      evicted before they get used.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f825962
    • D
      compat: move compat_siginfo_t definition to asm/compat.h · 751f409d
      Denys Vlasenko 提交于
      This is a preparatory patch for the introduction of NT_SIGINFO elf note.
      
      Make the location of compat_siginfo_t uniform across eight architectures
      which have it.  Now it can be pulled in by including asm/compat.h or
      linux/compat.h.
      
      Most of the copies are verbatim.  compat_uid[32]_t had to be replaced by
      __compat_uid[32]_t.  compat_uptr_t had to be moved up before
      compat_siginfo_t in asm/compat.h on a several architectures (tile already
      had it moved up).  compat_sigval_t had to be relocated from linux/compat.h
      to asm/compat.h.
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: "Jonathan M. Foote" <jmfoote@cert.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751f409d
    • J
      cross-arch: don't corrupt personality flags upon exec() · 16f3e95b
      Jiri Kosina 提交于
      Historically, the top three bytes of personality have been used for
      things such as ADDR_NO_RANDOMIZE, which made sense only for specific
      architectures.
      
      We now however have a flag there that is general no matter the
      architecture (UNAME26); generally we have to be careful to preserve the
      personality flags across exec().
      
      This patch tries to fix all architectures that forcefully overwrite
      personality flags during exec() (ppc32 and s390 have been fixed recently
      by commits f9783ec8 ("[S390] Do not clobber personality flags on
      exec") and 59e4c3a2 ("powerpc/32: Don't clobber personality flags on
      exec") in a similar way already).
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16f3e95b
  3. 05 10月, 2012 2 次提交
  4. 03 10月, 2012 3 次提交
  5. 29 9月, 2012 1 次提交
  6. 28 9月, 2012 2 次提交
  7. 27 9月, 2012 2 次提交
  8. 22 9月, 2012 4 次提交
  9. 19 9月, 2012 2 次提交
  10. 18 9月, 2012 1 次提交
  11. 16 9月, 2012 3 次提交
  12. 07 9月, 2012 3 次提交
    • D
      sparc64: Use cpu_pgsz_mask for linear kernel mapping config. · c69ad0a3
      David S. Miller 提交于
      This required a little bit of reordering of how we set up the memory
      management early on.
      
      We now only know the final values of kern_linear_pte_xor[] after we
      take over the trap table and start processing TLB misses ourselves.
      
      So once we fill those values in we re-clear the kernel's 4M TSB and
      flush the TLBs.  That way if we find we support larger than 4M pages
      we won't have any stale smaller page size entries in the TSB.
      
      SUN4U Panther support for larger page sizes should now be extremely
      trivial but I have no hardware on which to test it and I believe
      that some of the sun4u TLB miss assembler needs to be audited first
      to make sure it really can handle larger than 4M PTEs properly.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c69ad0a3
    • D
      sparc64: Probe cpu page size support more portably. · ce33fdc5
      David S. Miller 提交于
      On sun4v, interrogate the machine description.  This code is extremely
      defensive in nature, and a lot of the checks can probably be removed.
      
      On sun4u things are a lot simpler.  There are the page sizes all chips
      support, and then Panther adds 32MB and 256MB pages.
      
      Report the probed value in /proc/cpuinfo
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce33fdc5
    • D
      sparc64: Support 2GB and 16GB page sizes for kernel linear mappings. · 4f93d21d
      David S. Miller 提交于
      SPARC-T4 supports 2GB pages.
      
      So convert kpte_linear_bitmap into an array of 2-bit values which
      index into kern_linear_pte_xor.
      
      Now kern_linear_pte_xor is used for 4 page size aligned regions,
      4MB, 256MB, 2GB, and 16GB respectively.
      
      Enabling 2GB pages is currently hardcoded using a check against
      sun4v_chip_type.  In the future this will be done more cleanly
      by interrogating the machine description which is the correct
      way to determine this kind of thing.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f93d21d
  13. 02 9月, 2012 1 次提交
    • D
      sparc64: Fix bugs in unrolled 256-bit loops. · 699871bc
      David S. Miller 提交于
      Some dm-crypt testing revealed several bugs in the 256-bit unrolled
      loops.
      
      The DECRYPT_256_2() macro had two errors:
      
      1) Missing reload of KEY registers %f60 and %f62
      
      2) Missing "\" in penultimate line of definition.
      
      In aes_sparc64_ecb_decrypt_256, we were storing the second half of the
      encryption result from the wrong source registers.
      
      In aes_sparc64_ctr_crypt_256 we have to be careful when we fall out of
      the 32-byte-at-a-time loop and handle a trailing 16-byte chunk.  In
      that case we've clobbered the final key holding registers and have to
      restore them before executing the ENCRYPT_256() macro.  Inside of the
      32-byte-at-a-time loop things are OK, because we do this key register
      restoring during the first few rounds of the ENCRYPT_256_2() macro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      699871bc