1. 24 2月, 2013 1 次提交
  2. 18 2月, 2013 1 次提交
  3. 13 2月, 2013 1 次提交
    • M
      x86/mm: Check if PUD is large when validating a kernel address · 0ee364eb
      Mel Gorman 提交于
      A user reported the following oops when a backup process reads
      /proc/kcore:
      
       BUG: unable to handle kernel paging request at ffffbb00ff33b000
       IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110
       [...]
      
       Call Trace:
        [<ffffffff811b8aaa>] read_kcore+0x17a/0x370
        [<ffffffff811ad847>] proc_reg_read+0x77/0xc0
        [<ffffffff81151687>] vfs_read+0xc7/0x130
        [<ffffffff811517f3>] sys_read+0x53/0xa0
        [<ffffffff81449692>] system_call_fastpath+0x16/0x1b
      
      Investigation determined that the bug triggered when reading
      system RAM at the 4G mark. On this system, that was the first
      address using 1G pages for the virt->phys direct mapping so the
      PUD is pointing to a physical address, not a PMD page.
      
      The problem is that the page table walker in kern_addr_valid() is
      not checking pud_large() and treats the physical address as if
      it was a PMD.  If it happens to look like pmd_none then it'll
      silently fail, probably returning zeros instead of real data. If
      the data happens to look like a present PMD though, it will be
      walked resulting in the oops above.
      
      This patch adds the necessary pud_large() check.
      
      Unfortunately the problem was not readily reproducible and now
      they are running the backup program without accessing
      /proc/kcore so the patch has not been validated but I think it
      makes sense.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.coM>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: stable@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee364eb
  4. 08 2月, 2013 1 次提交
  5. 01 2月, 2013 2 次提交
    • H
      x86-32, mm: Remove reference to alloc_remap() · 07f4207a
      H. Peter Anvin 提交于
      We have removed the remap allocator for x86-32, and x86-64 never had
      it (and doesn't need it).  Remove residual reference to it.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/CAE9FiQVn6_QZi3fNQ-JHYiR-7jeDJ5hT0SyT_%2BzVvfOj=PzF3w@mail.gmail.com
      07f4207a
    • D
      x86-32, mm: Rip out x86_32 NUMA remapping code · f03574f2
      Dave Hansen 提交于
      This code was an optimization for 32-bit NUMA systems.
      
      It has probably been the cause of a number of subtle bugs over
      the years, although the conditions to excite them would have
      been hard to trigger.  Essentially, we remap part of the kernel
      linear mapping area, and then sometimes part of that area gets
      freed back in to the bootmem allocator.  If those pages get
      used by kernel data structures (say mem_map[] or a dentry),
      there's no big deal.  But, if anyone ever tried to use the
      linear mapping for these pages _and_ cared about their physical
      address, bad things happen.
      
      For instance, say you passed __GFP_ZERO to the page allocator
      and then happened to get handed one of these pages, it zero the
      remapped page, but it would make a pte to the _old_ page.
      There are probably a hundred other ways that it could screw
      with things.
      
      We don't need to hang on to performance optimizations for
      these old boxes any more.  All my 32-bit NUMA systems are long
      dead and buried, and I probably had access to more than most
      people.
      
      This code is causing real things to break today:
      
      	https://lkml.org/lkml/2013/1/9/376
      
      I looked in to actually fixing this, but it requires surgery
      to way too much brittle code, as well as stuff like
      per_cpu_ptr_to_phys().
      
      [ hpa: Cc: this for -stable, since it is a memory corruption issue.
        However, an alternative is to simply mark NUMA as depends BROKEN
        rather than EXPERIMENTAL in the X86_32 subclause... ]
      
      Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      f03574f2
  6. 31 1月, 2013 1 次提交
  7. 30 1月, 2013 7 次提交
    • Y
      x86, 64bit, mm: Mark data/bss/brk to nx · 72212675
      Yinghai Lu 提交于
      HPA said, we should not have RW and +x set at the time.
      
      for kernel layout:
      [    0.000000] Kernel Layout:
      [    0.000000]   .text: [0x01000000-0x021434f8]
      [    0.000000] .rodata: [0x02200000-0x02a13fff]
      [    0.000000]   .data: [0x02c00000-0x02dc763f]
      [    0.000000]   .init: [0x02dc9000-0x0312cfff]
      [    0.000000]    .bss: [0x0313b000-0x03dd6fff]
      [    0.000000]    .brk: [0x03dd7000-0x03dfffff]
      
      before the patch, we have
      ---[ High Kernel Mapping ]---
      0xffffffff80000000-0xffffffff81000000          16M                           pmd
      0xffffffff81000000-0xffffffff82200000          18M     ro         PSE GLB x  pmd
      0xffffffff82200000-0xffffffff82c00000          10M     ro         PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82dc9000        1828K     RW             GLB x  pte
      0xffffffff82dc9000-0xffffffff82e00000         220K     RW             GLB NX pte
      0xffffffff82e00000-0xffffffff83000000           2M     RW         PSE GLB NX pmd
      0xffffffff83000000-0xffffffff8313a000        1256K     RW             GLB NX pte
      0xffffffff8313a000-0xffffffff83200000         792K     RW             GLB x  pte
      0xffffffff83200000-0xffffffff83e00000          12M     RW         PSE GLB x  pmd
      0xffffffff83e00000-0xffffffffa0000000         450M                           pmd
      
      after patch,, we get
      ---[ High Kernel Mapping ]---
      0xffffffff80000000-0xffffffff81000000          16M                           pmd
      0xffffffff81000000-0xffffffff82200000          18M     ro         PSE GLB x  pmd
      0xffffffff82200000-0xffffffff82c00000          10M     ro         PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82e00000           2M     RW             GLB NX pte
      0xffffffff82e00000-0xffffffff83000000           2M     RW         PSE GLB NX pmd
      0xffffffff83000000-0xffffffff83200000           2M     RW             GLB NX pte
      0xffffffff83200000-0xffffffff83e00000          12M     RW         PSE GLB NX pmd
      0xffffffff83e00000-0xffffffffa0000000         450M                           pmd
      
      so data, bss, brk get NX ...
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-33-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      72212675
    • Y
      x86, kexec, 64bit: Only set ident mapping for ram. · 0e691cf8
      Yinghai Lu 提交于
      We should set mappings only for usable memory ranges under max_pfn
      Otherwise causes same problem that is fixed by
      
      	x86, mm: Only direct map addresses that are marked as E820_RAM
      
      This patch exposes pfn_mapped array, and only sets ident mapping for ranges
      in that array.
      
      This patch relies on new kernel_ident_mapping_init that could handle existing
      pgd/pud between different calls.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-25-git-send-email-yinghai@kernel.org
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      0e691cf8
    • Y
      x86, 64bit: Don't set max_pfn_mapped wrong value early on native path · 10054230
      Yinghai Lu 提交于
      We are not having max_pfn_mapped set correctly until init_memory_mapping.
      So don't print its initial value for 64bit
      
      Also need to use KERNEL_IMAGE_SIZE directly for highmap cleanup.
      
      -v2: update comments about max_pfn_mapped according to Stefano Stabellini.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-14-git-send-email-yinghai@kernel.orgAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      10054230
    • H
      x86, 64bit: Use a #PF handler to materialize early mappings on demand · 8170e6be
      H. Peter Anvin 提交于
      Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
      64-bit code has to use page tables.  This makes it awkward before we
      have first set up properly all-covering page tables to access objects
      that are outside the static kernel range.
      
      So far we have dealt with that simply by mapping a fixed amount of
      low memory, but that fails in at least two upcoming use cases:
      
      1. We will support load and run kernel, struct boot_params, ramdisk,
         command line, etc. above the 4 GiB mark.
      2. need to access ramdisk early to get microcode to update that as
         early possible.
      
      We could use early_iomap to access them too, but it will make code to
      messy and hard to be unified with 32 bit.
      
      Hence, set up a #PF table and use a fixed number of buffers to set up
      page tables on demand.  If the buffers fill up then we simply flush
      them and start over.  These buffers are all in __initdata, so it does
      not increase RAM usage at runtime.
      
      Thus, with the help of the #PF handler, we can set the final kernel
      mapping from blank, and switch to init_level4_pgt later.
      
      During the switchover in head_64.S, before #PF handler is available,
      we use three pages to handle kernel crossing 1G, 512G boundaries with
      sharing page by playing games with page aliasing: the same page is
      mapped twice in the higher-level tables with appropriate wraparound.
      The kernel region itself will be properly mapped; other mappings may
      be spurious.
      
      early_make_pgtable is using kernel high mapping address to access pages
      to set page table.
      
      -v4: Add phys_base offset to make kexec happy, and add
      	init_mapping_kernel()   - Yinghai
      -v5: fix compiling with xen, and add back ident level3 and level2 for xen
           also move back init_level4_pgt from BSS to DATA again.
           because we have to clear it anyway.  - Yinghai
      -v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
      -v7: remove not needed clear_page for init_level4_page
           it is with fill 512,8,0 already in head_64.S  - Yinghai
      -v8: we need to keep that handler alive until init_mem_mapping and don't
           let early_trap_init to trash that early #PF handler.
           So split early_trap_pf_init out and move it down. - Yinghai
      -v9: switchover only cover kernel space instead of 1G so could avoid
           touch possible mem holes. - Yinghai
      -v11: change far jmp back to far return to initial_code, that is needed
           to fix failure that is reported by Konrad on AMD systems.  - Yinghai
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      8170e6be
    • Y
      x86, 64bit, mm: Add generic kernel/ident mapping helper · aece2785
      Yinghai Lu 提交于
      It is simple version for kernel_physical_mapping_init.
      it will work to build one page table that will be used later.
      
      Use mapping_info to control
              1. alloc_pg_page method
              2. if PMD is EXEC,
              3. if pgd is with kernel low mapping or ident mapping.
      
      Will use to replace some local versions in kexec, hibernation and etc.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-8-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      aece2785
    • Y
      x86, 64bit, mm: Make pgd next calculation consistent with pud/pmd · c2bdee59
      Yinghai Lu 提交于
      Just like the way we calculate next for pud and pmd, aka round down and
      add size.
      
      Also, do not do boundary-checking with 'next', and just pass 'end' down
      to phys_pud_init() instead. Because the loop in phys_pud_init() stops at
      PTRS_PER_PUD and thus can handle a possibly bigger 'end' properly.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-6-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c2bdee59
    • Y
      x86, mm: Fix page table early allocation offset checking · c9b3234a
      Yinghai Lu 提交于
      During debugging loading kernel above 4G, found that one page is not used
      in pre-allocated BRK area for early page allocation.
      pgt_buf_top is address that can not be used, so should check if that new
      end is above that top, otherwise last page will not be used.
      
      Fix that checking and also add print out for allocation from pre-allocated
      BRK area to catch possible bugs later.
      
      But after we get back that page for pgt, it tiggers one bug in pgt allocation
      with xen: We need to avoid to use page as pgt to map range that is
      overlapping with that pgt page.
      
      Add checking about overlapping, when it happens, use memblock allocation
      instead.  That fixes crash on Xen PV guest with 2G that Stefan found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-2-git-send-email-yinghai@kernel.orgAcked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Tested-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c9b3234a
  8. 26 1月, 2013 3 次提交
  9. 25 1月, 2013 1 次提交
  10. 24 1月, 2013 2 次提交
  11. 16 12月, 2012 2 次提交
  12. 13 12月, 2012 2 次提交
  13. 12 12月, 2012 1 次提交
  14. 11 12月, 2012 2 次提交
    • R
      x86: mm: drop TLB flush from ptep_set_access_flags · e4a1cc56
      Rik van Riel 提交于
      Intel has an architectural guarantee that the TLB entry causing
      a page fault gets invalidated automatically. This means
      we should be able to drop the local TLB invalidation.
      
      Because of the way other areas of the page fault code work,
      chances are good that all x86 CPUs do this.  However, if
      someone somewhere has an x86 CPU that does not invalidate
      the TLB entry causing a page fault, this one-liner should
      be easy to revert.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      e4a1cc56
    • R
      x86: mm: only do a local tlb flush in ptep_set_access_flags() · 0f9a921c
      Rik van Riel 提交于
      The function ptep_set_access_flags() is only ever invoked to set access
      flags or add write permission on a PTE.  The write bit is only ever set
      together with the dirty bit.
      
      Because we only ever upgrade a PTE, it is safe to skip flushing entries on
      remote TLBs. The worst that can happen is a spurious page fault on other
      CPUs, which would flush that TLB entry.
      
      Lazily letting another CPU incur a spurious page fault occasionally is
      (much!) cheaper than aggressively flushing everybody else's TLB.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      0f9a921c
  15. 06 12月, 2012 1 次提交
  16. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  17. 30 11月, 2012 2 次提交
  18. 23 11月, 2012 1 次提交
    • I
      x86/mm: Don't flush the TLB on #WP pmd fixups · 5e4bf1a5
      Ingo Molnar 提交于
      If we have a write protection #PF and fix up the pmd then the
      hugetlb code [the only user of pmdp_set_access_flags], in its
      do_huge_pmd_wp_page() page fault resolution function calls
      pmdp_set_access_flags() to mark the pmd permissive again,
      and flushes the TLB.
      
      This TLB flush is unnecessary: a flush on #PF is guaranteed on
      most (all?) x86 CPUs, and even in the worst-case we'll generate
      a spurious fault.
      
      So remove it.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20121120120251.GA15742@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5e4bf1a5
  19. 18 11月, 2012 8 次提交