1. 27 5月, 2015 4 次提交
    • L
      x86/mm/pat: Wrap pat_enabled into a function API · cb32edf6
      Luis R. Rodriguez 提交于
      We use pat_enabled in x86-specific code to see if PAT is enabled
      or not but we're granting full access to it even though readers
      do not need to set it. If, for instance, we granted access to it
      to modules later they then could override the variable
      setting... no bueno.
      
      This renames pat_enabled to a new static variable __pat_enabled.
      Folks are redirected to use pat_enabled() now.
      
      Code that sets this can only be internal to pat.c. Apart from
      the early kernel parameter "nopat" to disable PAT, we also have
      a few cases that disable it later and make use of a helper
      pat_disable(). It is wrapped under an ifdef but since that code
      cannot run unless PAT was enabled its not required to wrap it
      with ifdefs, unwrap that. Likewise, since "nopat" doesn't really
      change non-PAT systems just remove that ifdef as well.
      
      Although we could add and use an early_param_off(), these
      helpers don't use __read_mostly but we want to keep
      __read_mostly for __pat_enabled as this is a hot path -- upon
      boot, for instance, a simple guest may see ~4k accesses to
      pat_enabled(). Since __read_mostly early boot params are not
      that common we don't add a helper for them just yet.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1430425520-22275-3-git-send-email-mcgrof@do-not-panic.com
      Link: http://lkml.kernel.org/r/1432628901-18044-13-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb32edf6
    • L
      x86/mm/pat: Convert to pr_*() usage · 9e76561f
      Luis R. Rodriguez 提交于
      Use pr_info() instead of the old printk to prefix the component
      where things are coming from. With this readers will know
      exactly where the message is coming from. We use pr_* helpers
      but define pr_fmt to the empty string for easier grepping for
      those error messages.
      
      We leave the users of dprintk() in place, this will print only
      when the debugpat kernel parameter is enabled. We want to leave
      those enabled as a debug feature, but also make them use the
      same prefix.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      [ Kill pr_fmt. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: cocci@systeme.lip6.fr
      Cc: plagnioj@jcrosoft.com
      Cc: tomi.valkeinen@ti.com
      Link: http://lkml.kernel.org/r/1430425520-22275-2-git-send-email-mcgrof@do-not-panic.com
      Link: http://lkml.kernel.org/r/1432628901-18044-9-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9e76561f
    • T
      x86/mm/mtrr: Enhance MTRR checks in kernel mapping helpers · b73522e0
      Toshi Kani 提交于
      This patch adds the argument 'uniform' to mtrr_type_lookup(),
      which gets set to 1 when a given range is covered uniformly by
      MTRRs, i.e. the range is fully covered by a single MTRR entry or
      the default type.
      
      Change pud_set_huge() and pmd_set_huge() to honor the 'uniform'
      flag to see if it is safe to create a huge page mapping in the
      range.
      
      This allows them to create a huge page mapping in a range
      covered by a single MTRR entry of any memory type. It also
      detects a non-optimal request properly. They continue to check
      with the WB type since it does not effectively change the
      uniform mapping even if a request spans multiple MTRR entries.
      
      pmd_set_huge() logs a warning message to a non-optimal request
      so that driver writers will be aware of such a case. Drivers
      should make a mapping request aligned to a single MTRR entry
      when the range is covered by MTRRs.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [ Realign, flesh out comments, improve warning message. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Elliott@hp.com
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave.hansen@intel.com
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: pebolle@tiscali.nl
      Link: http://lkml.kernel.org/r/1431714237-880-7-git-send-email-toshi.kani@hp.com
      Link: http://lkml.kernel.org/r/1432628901-18044-8-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b73522e0
    • T
      x86/mm/mtrr: Use symbolic define as a retval for disabled MTRRs · 3d3ca416
      Toshi Kani 提交于
      mtrr_type_lookup() returns verbatim 0xFF when MTRRs are
      disabled. This patch defines MTRR_TYPE_INVALID to clarify the
      meaning of this value, and documents its usage.
      
      Document the return values of the kernel virtual address mapping
      helpers pud_set_huge(), pmd_set_huge, pud_clear_huge() and
      pmd_clear_huge().
      
      There is no functional change in this patch.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Elliott@hp.com
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave.hansen@intel.com
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: pebolle@tiscali.nl
      Link: http://lkml.kernel.org/r/1431714237-880-5-git-send-email-toshi.kani@hp.com
      Link: http://lkml.kernel.org/r/1432628901-18044-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3d3ca416
  2. 11 5月, 2015 3 次提交
    • D
      x86/mm/pageattr: Remove an unused variable in slow_virt_to_phys() · 1fcb61c5
      Dexuan Cui 提交于
      The patch doesn't change any logic.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1429776428-4475-1-git-send-email-decui@microsoft.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1fcb61c5
    • L
      x86/mm: Add ioremap_uc() helper to map memory uncacheable (not UC-) · e4b6be33
      Luis R. Rodriguez 提交于
      ioremap_nocache() currently uses UC- by default. Our goal is to
      eventually make UC the default. Linux maps UC- to PCD=1, PWT=0
      page attributes on non-PAT systems. Linux maps UC to PCD=1,
      PWT=1 page attributes on non-PAT systems. On non-PAT and PAT
      systems a WC MTRR has different effects on pages with either of
      these attributes. In order to help with a smooth transition its
      best to enable use of UC (PCD,1, PWT=1) on a region as that
      ensures a WC MTRR will have no effect on a region, this however
      requires us to have an way to declare a region as UC and we
      currently do not have a way to do this.
      
        WC MTRR on non-PAT system with PCD=1, PWT=0 (UC-) yields WC.
        WC MTRR on non-PAT system with PCD=1, PWT=1 (UC)  yields UC.
      
        WC MTRR on PAT system with PCD=1, PWT=0 (UC-) yields WC.
        WC MTRR on PAT system with PCD=1, PWT=1 (UC)  yields UC.
      
      A flip of the default ioremap_nocache() behaviour from UC- to UC
      can therefore regress a memory region from effective memory type
      WC to UC if MTRRs are used. Use of MTRRs should be phased out
      and in the best case only arch_phys_wc_add() use will remain,
      even if this happens arch_phys_wc_add() will have an effect on
      non-PAT systems and changes to default ioremap_nocache()
      behaviour could regress drivers.
      
      Now, ideally we'd use ioremap_nocache() on the regions in which
      we'd need uncachable memory types and avoid any MTRRs on those
      regions. There are however some restrictions on MTRRs use, such
      as the requirement of having the base and size of variable sized
      MTRRs to be powers of two, which could mean having to use a WC
      MTRR over a large area which includes a region in which
      write-combining effects are undesirable.
      
      Add ioremap_uc() to help with the both phasing out of MTRR use
      and also provide a way to blacklist small WC undesirable regions
      in devices with mixed regions which are size-implicated to use
      large WC MTRRs. Use of ioremap_uc() helps phase out MTRR use by
      avoiding regressions with an eventual flip of default behaviour
      or ioremap_nocache() from UC- to UC.
      
      Drivers working with WC MTRRs can use the below table to review
      and consider the use of ioremap*() and similar helpers to ensure
      appropriate behaviour long term even if default
      ioremap_nocache() behaviour changes from UC- to UC.
      
      Although ioremap_uc() is being added we leave set_memory_uc() to
      use UC- as only initial memory type setup is required to be able
      to accommodate existing device drivers and phase out MTRR use.
      It should also be clarified that set_memory_uc() cannot be used
      with IO memory, even though its use will not return any errors,
      it really has no effect.
      
        ----------------------------------------------------------------------
        MTRR Non-PAT   PAT    Linux ioremap value        Effective memory type
        ----------------------------------------------------------------------
                                                          Non-PAT |  PAT
             PAT
             |PCD
             ||PWT
             |||
        WC   000      WB      _PAGE_CACHE_MODE_WB            WC   |   WC
        WC   001      WC      _PAGE_CACHE_MODE_WC            WC*  |   WC
        WC   010      UC-     _PAGE_CACHE_MODE_UC_MINUS      WC*  |   WC
        WC   011      UC      _PAGE_CACHE_MODE_UC            UC   |   UC
        ----------------------------------------------------------------------
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Antonino Daplas <adaplas@gmail.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Ville Syrjälä <syrjala@sci.fi>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fbdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/1430343851-967-2-git-send-email-mcgrof@do-not-panic.com
      Link: http://lkml.kernel.org/r/1431332153-18566-9-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4b6be33
    • R
      x86/mm: Do not flush last cacheline twice in clflush_cache_range() · 6c434d61
      Ross Zwisler 提交于
      The current algorithm used in clflush_cache_range() can cause
      the last cache line of the buffer to be flushed twice. Fix that
      algorithm so that each cache line will only be flushed once.
      Reported-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Link: http://lkml.kernel.org/r/1430259192-18802-1-git-send-email-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/1431332153-18566-5-git-send-email-bp@alien8.de
      [ Changed it to 'void *' to simplify the type conversions. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6c434d61
  3. 20 4月, 2015 1 次提交
  4. 15 4月, 2015 6 次提交
    • V
      mm: move memtest under mm · 4a20799d
      Vladimir Murzin 提交于
      Memtest is a simple feature which fills the memory with a given set of
      patterns and validates memory contents, if bad memory regions is detected
      it reserves them via memblock API.  Since memblock API is widely used by
      other architectures this feature can be enabled outside of x86 world.
      
      This patch set promotes memtest to live under generic mm umbrella and
      enables memtest feature for arm/arm64.
      
      It was reported that this patch set was useful for tracking down an issue
      with some errant DMA on an arm64 platform.
      
      This patch (of 6):
      
      There is nothing platform dependent in the core memtest code, so other
      platforms might benefit from this feature too.
      
      [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
      Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a20799d
    • K
      mm: expose arch_mmap_rnd when available · 2b68f6ca
      Kees Cook 提交于
      When an architecture fully supports randomizing the ELF load location,
      a per-arch mmap_rnd() function is used to find a randomized mmap base.
      In preparation for randomizing the location of ET_DYN binaries
      separately from mmap, this renames and exports these functions as
      arch_mmap_rnd(). Additionally introduces CONFIG_ARCH_HAS_ELF_RANDOMIZE
      for describing this feature on architectures that support it
      (which is a superset of ARCH_BINFMT_ELF_RANDOMIZE_PIE, since s390
      already supports a separated ET_DYN ASLR from mmap ASLR without the
      ARCH_BINFMT_ELF_RANDOMIZE_PIE logic).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b68f6ca
    • K
      x86: standardize mmap_rnd() usage · 82168140
      Kees Cook 提交于
      In preparation for splitting out ET_DYN ASLR, this refactors the use of
      mmap_rnd() to be used similarly to arm, and extracts the checking of
      PF_RANDOMIZE.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82168140
    • T
      x86, mm: support huge KVA mappings on x86 · 6b637835
      Toshi Kani 提交于
      Implement huge KVA mapping interfaces on x86.
      
      On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
      using a huge page, MTRRs can override the memory type of the huge page,
      which may lead a performance penalty.  The processor can also behave in an
      undefined manner if a huge page is mapped to a memory range that MTRRs
      have mapped with multiple different memory types.  Therefore, the mapping
      code falls back to use a smaller page size toward 4KB when a mapping range
      is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
      the PAT memory types.
      
      pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see if a
      given range is covered by MTRRs.  MTRR_TYPE_WRBACK indicates that the
      range is either covered by WB or not covered and the MTRR default value is
      set to WB.  0xFF indicates that MTRRs are disabled.
      
      HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE is set.
       X86_32 without X86_PAE is not supported since such config can unlikey be
      benefited from this feature, and there was an issue found in testing.
      
      [fengguang.wu@intel.com: ioremap_pud_capable can be static]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b637835
    • T
      x86, mm: support huge I/O mapping capability I/F · 5d72b4fb
      Toshi Kani 提交于
      Implement huge I/O mapping capability interfaces for ioremap() on x86.
      
      IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and PMD_SHIFT on
      x86/32, which overrides the default value defined in <linux/vmalloc.h>.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d72b4fb
    • K
      x86: expose number of page table levels on Kconfig level · 98233368
      Kirill A. Shutemov 提交于
      We would want to use number of page table level to define mm_struct.
      Let's expose it as CONFIG_PGTABLE_LEVELS.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98233368
  5. 07 4月, 2015 1 次提交
    • D
      x86/mm/numa: Fix kernel stack corruption in numa_init()->numa_clear_kernel_node_hotplug() · 22ef882e
      Dave Young 提交于
      I got below kernel panic during kdump test on Thinkpad T420
      laptop:
      
      [    0.000000] No NUMA configuration found
      [    0.000000] Faking a node at [mem 0x0000000000000000-0x0000000037ba4fff]
      [    0.000000] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81d21910
       ...
      [    0.000000] Call Trace:
      [    0.000000]  [<ffffffff817c2a26>] dump_stack+0x45/0x57
      [    0.000000]  [<ffffffff817bc8d2>] panic+0xd0/0x204
      [    0.000000]  [<ffffffff81d21910>] ? numa_clear_kernel_node_hotplug+0xe6/0xf2
      [    0.000000]  [<ffffffff8107741b>] __stack_chk_fail+0x1b/0x20
      [    0.000000]  [<ffffffff81d21910>] numa_clear_kernel_node_hotplug+0xe6/0xf2
      [    0.000000]  [<ffffffff81d21e5d>] numa_init+0x1a5/0x520
      [    0.000000]  [<ffffffff81d222b1>] x86_numa_init+0x19/0x3d
      [    0.000000]  [<ffffffff81d22460>] initmem_init+0x9/0xb
      [    0.000000]  [<ffffffff81d0d00c>] setup_arch+0x94f/0xc82
      [    0.000000]  [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
      [    0.000000]  [<ffffffff817bd0bb>] ? printk+0x55/0x6b
      [    0.000000]  [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
      [    0.000000]  [<ffffffff81d05d9b>] start_kernel+0xe8/0x4d6
      [    0.000000]  [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
      [    0.000000]  [<ffffffff81d05120>] ? early_idt_handlers+0x120/0x120
      [    0.000000]  [<ffffffff81d055ee>] x86_64_start_reservations+0x2a/0x2c
      [    0.000000]  [<ffffffff81d05751>] x86_64_start_kernel+0x161/0x184
      [    0.000000] ---[ end Kernel panic - not syncing: stack-protector: Kernel sta
      
      This is caused by writing over the end of numa mask bitmap
      in numa_clear_kernel_node().
      
      numa_clear_kernel_node() tries to set the node id in a mask bitmap,
      by iterating all reserved regions and assuming that every region
      has a valid nid.
      
      This assumption is not true because there's an exception for some
      graphic memory quirks. See trim_snb_memory() in arch/x86/kernel/setup.c
      
      It is easily to reproduce the bug in the kdump kernel because kdump
      kernel use pre-reserved memory instead of the whole memory, but
      kexec pass other reserved memory ranges to 2nd kernel as well.
      like below in my test:
      
      kdump kernel ram 0x2d000000 - 0x37bfffff
      One of the reserved regions: 0x40000000 - 0x40100000 which
      includes 0x40004000, a page excluded in trim_snb_memory(). For
      this memblock reserved region the nid is not set, it is still
      default value MAX_NUMNODES. later node_set will set bit
      MAX_NUMNODES thus stack corruption happen.
      
      This also happens when booting with mem= kernel commandline
      during my test.
      
      Fixing it by adding a check, do not call node_set in case nid is
      MAX_NUMNODES.
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bhe@redhat.com
      Cc: qiuxishi@huawei.com
      Link: http://lkml.kernel.org/r/20150407134132.GA23522@dhcp-16-198.nay.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      22ef882e
  6. 23 3月, 2015 2 次提交
  7. 05 3月, 2015 6 次提交
    • I
      x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a... · c709feda
      Ingo Molnar 提交于
      x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a bit more readable fashion
      
      The initialization of these two arrays is a bit difficult to follow:
      restructure it optically so that a 2D structure shows which bit in
      the PTE is set and which not.
      
      Also improve on comments a bit.
      
      No code or data changed:
      
        # arch/x86/mm/init.o:
      
         text    data     bss     dec     hex filename
         4585     424   29776   34785    87e1 init.o.before
         4585     424   29776   34785    87e1 init.o.after
      
      md5:
         a82e11ff58bcfd0af3a94662a701f65d  init.o.before.asm
         a82e11ff58bcfd0af3a94662a701f65d  init.o.after.asm
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150305082135.GB5969@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c709feda
    • I
      x86/mm: Simplify probe_page_size_mask() · e61980a7
      Ingo Molnar 提交于
      Now that we've simplified the gbpages config space, move the
      'page_size_mask' initialization into probe_page_size_mask(),
      right next to the PSE and PGE enablement lines.
      
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e61980a7
    • I
      x86/mm: Further simplify 1 GB kernel linear mappings handling · 10971ab2
      Ingo Molnar 提交于
      It's a bit pointless to allow Kconfig configuration for 1GB kernel
      mappings, it's already hidden behind a 'default y' and CONFIG_EXPERT.
      
      Remove this complication and simplify the code by renaming
      CONFIG_ENABLE_DIRECT_GBPAGES to CONFIG_X86_DIRECT_GBPAGES and
      document the DEBUG_PAGE_ALLOC and KMEMCHECK quirks.
      
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10971ab2
    • L
      x86/mm: Use early_param_on_off() for direct_gbpages · 73c8c861
      Luis R. Rodriguez 提交于
      The enabler / disabler is pretty simple, just use the
      provided wrappers, this lets us easily relate the variable
      to the associated Kconfig entry.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Link: http://lkml.kernel.org/r/1425518654-3403-5-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      73c8c861
    • L
      x86/mm: Simplify enabling direct_gbpages · e5008abe
      Luis R. Rodriguez 提交于
      direct_gbpages can be force enabled as an early parameter
      but not really have taken effect when DEBUG_PAGEALLOC
      or KMEMCHECK is enabled. You can also enable direct_gbpages
      right now if you have an x86_64 architecture but your CPU
      doesn't really have support for this feature. In both cases
      PG_LEVEL_1G won't actually be enabled but direct_gbpages is used
      in other areas under the assumptions that PG_LEVEL_1G
      was set. Fix this by putting together all requirements
      which make this feature sensible to enable under, and only
      enable both finally flipping on PG_LEVEL_1G and leaving
      PG_LEVEL_1G set when this is true.
      
      We only enable this feature then to be possible on sensible
      builds defined by the new ENABLE_DIRECT_GBPAGES. If the
      CPU has support for it you can either enable this by using
      the DIRECT_GBPAGES option or using the early kernel parameter.
      If a platform had support for this you can always force disable
      it as well.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Link: http://lkml.kernel.org/r/1425518654-3403-3-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e5008abe
    • L
      x86/mm: Use IS_ENABLED() for direct_gbpages · d9fd579c
      Luis R. Rodriguez 提交于
      Replace #ifdef eyesore with IS_ENABLED() use.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Link: http://lkml.kernel.org/r/1425518654-3403-2-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9fd579c
  8. 28 2月, 2015 1 次提交
  9. 19 2月, 2015 5 次提交
    • H
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 4e7c22d4
      Hector Marco-Gisbert 提交于
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: NIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: NBorislav Petkov <bp@suse.de>
      4e7c22d4
    • D
      x86/mm/init: Fix incorrect page size in init_memory_mapping() printks · f15e0518
      Dave Hansen 提交于
      With 32-bit non-PAE kernels, we have 2 page sizes available
      (at most): 4k and 4M.
      
      Enabling PAE replaces that 4M size with a 2M one (which 64-bit
      systems use too).
      
      But, when booting a 32-bit non-PAE kernel, in one of our
      early-boot printouts, we say:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 2M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 2M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
      
      Which is obviously wrong.  There is no 2M page available.  This
      is probably because of a badly-named variable: in the map_range
      code: PG_LEVEL_2M.
      
      Instead of renaming all the PG_LEVEL_2M's.  This patch just
      fixes the printout:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 4M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 4M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
        BRK [0x03206000, 0x03206fff] PGTABLE
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      f15e0518
    • J
      x86-64: Also clear _PAGE_GLOBAL from __supported_pte_mask if !cpu_has_pge · 0cdb81be
      Jan Beulich 提交于
      Not just setting it when the feature is available is for
      consistency, and may allow Xen to drop its custom clearing of
      the flag (unless it needs it cleared earlier than this code
      executes). Note that the change is benign to ix86, as the flag
      starts out clear there.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/54C215D10200007800058912@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0cdb81be
    • P
      x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases · 1f40a8bf
      Pavel Machek 提交于
      STRICT_DEVMEM and PAT produce same failure accessing /dev/mem,
      which is quite confusing to the user. Make printk messages
      different to lessen confusion.
      Signed-off-by: NPavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1f40a8bf
    • F
      x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes · 1db491f7
      Fenghua Yu 提交于
      With more embedded systems emerging using Quark, among other
      things, 32-bit kernel matters again. 32-bit machine and kernel
      uses PAE paging, which currently wastes at least 4K of memory
      per process on Linux where we have to reserve an entire page to
      support a single 32-byte PGD structure. It would be a very good
      thing if we could eliminate that wastage.
      
      PAE paging is used to access more than 4GB memory on x86-32. And
      it is required for NX.
      
      In this patch, we still allocate one page for pgd for a Xen
      domain and 64-bit kernel because one page pgd is assumed in
      these cases. But we can save memory space by only allocating
      32-byte pgd for 32-bit PAE kernel when it is not running as a
      Xen domain.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Glenn Williamson <glenn.p.williamson@intel.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1421382601-46912-1-git-send-email-fenghua.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1db491f7
  10. 14 2月, 2015 4 次提交
    • A
      kasan: enable instrumentation of global variables · bebf56a1
      Andrey Ryabinin 提交于
      This feature let us to detect accesses out of bounds of global variables.
      This will work as for globals in kernel image, so for globals in modules.
      Currently this won't work for symbols in user-specified sections (e.g.
      __init, __read_mostly, ...)
      
      The idea of this is simple.  Compiler increases each global variable by
      redzone size and add constructors invoking __asan_register_globals()
      function.  Information about global variable (address, size, size with
      redzone ...) passed to __asan_register_globals() so we could poison
      variable's redzone.
      
      This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
      address making shadow memory handling (
      kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
      guarantees that each shadow page backing modules address space correspond
      to only one module_alloc() allocation.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebf56a1
    • A
      kasan: enable stack instrumentation · c420f167
      Andrey Ryabinin 提交于
      Stack instrumentation allows to detect out of bounds memory accesses for
      variables allocated on stack.  Compiler adds redzones around every
      variable on stack and poisons redzones in function's prologue.
      
      Such approach significantly increases stack usage, so all in-kernel stacks
      size were doubled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c420f167
    • A
      x86_64: add KASan support · ef7f0d6a
      Andrey Ryabinin 提交于
      This patch adds arch specific code for kernel address sanitizer.
      
      16TB of virtual addressed used for shadow memory.  It's located in range
      [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
      stacks.
      
      At early stage we map whole shadow region with zero page.  Latter, after
      pages mapped to direct mapping address range we unmap zero pages from
      corresponding shadow (see kasan_map_shadow()) and allocate and map a real
      shadow memory reusing vmemmap_populate() function.
      
      Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
      CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
      __phys_addr is instrumented, so __asan_load could be called before shadow
      area initialized.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7f0d6a
    • T
      x86: use %*pb[l] to print bitmaps including cpumasks and nodemasks · bf58b487
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * Unnecessary buffer size calculation and condition on the lenght
        removed from intel_cacheinfo.c::show_shared_cpu_map_func().
      
      * uv_nmi_nr_cpus_pr() got overly smart and implemented "..."
        abbreviation if the output stretched over the predefined 1024 byte
        buffer.  Replaced with plain printk.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf58b487
  11. 13 2月, 2015 1 次提交
  12. 12 2月, 2015 4 次提交
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • N
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · cbef8478
      Naoya Horiguchi 提交于
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbef8478
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
  13. 11 2月, 2015 1 次提交
  14. 04 2月, 2015 1 次提交