1. 08 12月, 2014 1 次提交
  2. 19 11月, 2014 2 次提交
    • D
      x86 mpx: Change return type of get_reg_offset() · 68c009c4
      Dave Hansen 提交于
      get_reg_offset() used to return the register contents themselves
      instead of the register offset.  When it did that, it was an
      unsigned long.  I changed it to return an integer _offset_
      instead of the register.  But, I neglected to change the return
      type of the function or the variables in which we store the
      result of the call.
      
      This fixes up the code to clear up the warnings from the smatch
      bot:
      
      New smatch warnings:
      arch/x86/mm/mpx.c:178 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      arch/x86/mm/mpx.c:184 mpx_get_addr_ref() warn: unsigned 'base_offset' is never less than zero.
      arch/x86/mm/mpx.c:188 mpx_get_addr_ref() warn: unsigned 'indx_offset' is never less than zero.
      arch/x86/mm/mpx.c:196 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182343.C3E0C629@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      68c009c4
    • K
      x86, mm: Set NX across entire PMD at boot · 45e2a9d4
      Kees Cook 提交于
      When setting up permissions on kernel memory at boot, the end of the
      PMD that was split from bss remained executable. It should be NX like
      the rest. This performs a PMD alignment instead of a PAGE alignment to
      get the correct span of memory.
      
      Before:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82c00000    10M     RW   PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82df5000  2004K     RW       GLB NX pte
      0xffffffff82df5000-0xffffffff82e00000    44K     RW       GLB x  pte
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      After:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82e00000    12M     RW   PSE GLB NX pmd
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment.
              We really should unmap the reminder along with the holes
              caused by init,initdata etc. but thats a different issue ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      45e2a9d4
  3. 18 11月, 2014 4 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
    • D
      x86, mpx: Decode MPX instruction to get bound violation information · fcc7ffd6
      Dave Hansen 提交于
      This patch sets bound violation fields of siginfo struct in #BR
      exception handler by decoding the user instruction and constructing
      the faulting pointer.
      
      We have to be very careful when decoding these instructions.  They
      are completely controlled by userspace and may be changed at any
      time up to and including the point where we try to copy them in to
      the kernel.  They may or may not be MPX instructions and could be
      completely invalid for all we know.
      
      Note: This code is based on Qiaowei Ren's specialized MPX
      decoder, but uses the generic decoder whenever possible.  It was
      tested for robustness by generating a completely random data
      stream and trying to decode that stream.  I also unmapped random
      pages inside the stream to test the "partial instruction" short
      read code.
      
      We kzalloc() the siginfo instead of stack allocating it because
      we need to memset() it anyway, and doing this makes it much more
      clear when it got initialized by the MPX instruction decoder.
      
      Changes from the old decoder:
       * Use the generic decoder instead of custom functions.  Saved
         ~70 lines of code overall.
       * Remove insn->addr_bytes code (never used??)
       * Make sure never to possibly overflow the regoff[] array, plus
         check the register range correctly in 32 and 64-bit modes.
       * Allow get_reg() to return an error and have mpx_get_addr_ref()
         handle when it sees errors.
       * Only call insn_get_*() near where we actually use the values
         instead if trying to call them all at once.
       * Handle short reads from copy_from_user() and check the actual
         number of read bytes against what we expect from
         insn_get_length().  If a read stops in the middle of an
         instruction, we error out.
       * Actually check the opcodes intead of ignoring them.
       * Dynamically kzalloc() siginfo_t so we don't leak any stack
         data.
       * Detect and handle decoder failures instead of ignoring them.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151828.5BDD0915@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fcc7ffd6
    • Q
      x86, mpx: Add MPX-specific mmap interface · 57319d80
      Qiaowei Ren 提交于
      We have chosen to perform the allocation of bounds tables in
      kernel (See the patch "on-demand kernel allocation of bounds
      tables") and to mark these VMAs with VM_MPX.
      
      However, there is currently no suitable interface to actually do
      this.  Existing interfaces, like do_mmap_pgoff(), have no way to
      set a modified ->vm_ops or ->vm_flags and don't hold mmap_sem
      long enough to let a caller do it.
      
      This patch wraps mmap_region() and hold mmap_sem long enough to
      make the modifications to the VMA which we need.
      
      Also note the 32/64-bit #ifdef in the header.  We actually need
      to do this at runtime eventually.  But, for now, we don't support
      running 32-bit binaries on 64-bit kernels.  Support for this will
      come in later patches.
      Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151827.CE440F67@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      57319d80
  4. 17 11月, 2014 1 次提交
    • T
      x86: mm: Move PAT only functions to mm/pat.c · 0dbcae88
      Thomas Gleixner 提交于
      Commit e00c8cc9 "x86: Use new cache mode type in memtype related
      functions" broke the ARCH=um build.
      
       arch/x86/include/asm/cacheflush.h:67:36: error: return type is an incomplete type
       static inline enum page_cache_mode get_page_memtype(struct page *pg)
      
      The reason is simple. get_page_memtype() and set_page_memtype()
      require enum page_cache_mode now, which is defined in
      asm/pgtable_types.h. UM does not include that file for obvious reasons.
      
      The simple solution is to move that functions to arch/x86/mm/pat.c
      where the only callsites of this are located. They should have been
      there in the first place.
      
      Fixes: e00c8cc9 "x86: Use new cache mode type in memtype related functions"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      0dbcae88
  5. 16 11月, 2014 12 次提交
  6. 12 11月, 2014 1 次提交
  7. 10 11月, 2014 1 次提交
    • T
      /dev/mem: Use more consistent data types · 4707a341
      Thierry Reding 提交于
      The xlate_dev_{kmem,mem}_ptr() functions take either a physical address
      or a kernel virtual address, so data types should be phys_addr_t and
      void *. They both return a kernel virtual address which is only ever
      used in calls to copy_{from,to}_user(), so make variables that store it
      void * rather than char * for consistency.
      
      Also only define a weak unxlate_dev_mem_ptr() function if architectures
      haven't overridden them in the asm/io.h header file.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      4707a341
  8. 05 11月, 2014 1 次提交
  9. 29 10月, 2014 1 次提交
  10. 14 10月, 2014 2 次提交
    • X
      arch/x86/mm/numa.c: fix boot failure when all nodes are hotpluggable · bd5cfb89
      Xishi Qiu 提交于
      If all the nodes are marked hotpluggable, alloc node data will fail.
      Because __next_mem_range_rev() will skip the hotpluggable memory
      regions.  numa_clear_kernel_node_hotplug() is called after alloc node
      data.
      
      numa_init()
          ...
          ret = init_func();  // this will mark hotpluggable flag from SRAT
          ...
          memblock_set_bottom_up(false);
          ...
          ret = numa_register_memblks(&numa_meminfo);  // this will alloc node data(pglist_data)
          ...
          numa_clear_kernel_node_hotplug();  // in case all the nodes are hotpluggable
          ...
      
      numa_register_memblks()
          setup_node_data()
              memblock_find_in_range_node()
                  __memblock_find_range_top_down()
                      for_each_mem_range_rev()
                          __next_mem_range_rev()
      
      This patch moves numa_clear_kernel_node_hotplug() into
      numa_register_memblks(), clear kernel node hotpluggable flag before
      alloc node data, then alloc node data won't fail even all the nodes
      are hotpluggable.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd5cfb89
    • M
      x86: use optimized ioresource lookup in ioremap function · 906e36c5
      Mike Travis 提交于
      Use the optimized ioresource lookup, "region_is_ram", for the ioremap
      function.  If the region is not found, it falls back to the
      "page_is_ram" function.  If it is found and it is RAM, then the usual
      warning message is issued, and the ioremap operation is aborted.
      Otherwise, the ioremap operation continues.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: NCliff Wickman <cpw@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906e36c5
  11. 23 9月, 2014 2 次提交
    • D
      x86: remove the Xen-specific _PAGE_IOMAP PTE flag · f955371c
      David Vrabel 提交于
      The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
      that were used to map I/O regions that are 1:1 in the p2m.  This
      allowed Xen to obtain the correct PFN when converting the MFNs read
      from a PTE back to their PFN.
      
      Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
      returns the correct PFN by using a combination of the m2p and p2m to
      determine if an MFN corresponds to a 1:1 mapping in the the p2m.
      
      Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
      future uses of the PTE flag.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      f955371c
    • D
      x86: skip check for spurious faults for non-present faults · 31668511
      David Vrabel 提交于
      If a fault on a kernel address is due to a non-present page, then it
      cannot be the result of stale TLB entry from a protection change (RO
      to RW or NX to X).  Thus the pagetable walk in spurious_fault() can be
      skipped.
      
      See the initial if in spurious_fault() and the tests in
      spurious_fault_check()) for the set of possible error codes checked
      for spurious faults.  These are:
      
               IRUWP
      Before   x00xx && ( 1xxxx || xxx1x )
      After  ( 10001 || 00011 ) && ( 1xxxx || xxx1x )
      
      Thus the new condition is a subset of the previous one, excluding only
      non-present faults (I == 1 and W == 1 are mutually exclusive).
      
      This avoids spurious_fault() oopsing in some cases if the pagetables
      it attempts to walk are not accessible.  This obscures the location of
      the original fault.
      
      This also fixes a crash with Xen PV guests when they access entries in
      the M2P corresponding to device MMIO regions.  The M2P is mapped
      (read-only) by Xen into the kernel address space of the guest and this
      mapping may contains holes for non-RAM regions.  Read faults will
      result in calls to spurious_fault(), but because the page tables for
      the M2P mappings are not accessible by the guest the pagetable walk
      would fault.
      
      This was not normally a problem as MMIO mappings would not normally
      result in a M2P lookup because of the use of the _PAGE_IOMAP bit the
      PTE.  However, removing the _PAGE_IOMAP bit requires M2P lookups for
      MMIO mappings as well.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      31668511
  12. 19 9月, 2014 2 次提交
    • A
      sched: Add helper for task stack page overrun checking · a70857e4
      Aaron Tomlin 提交于
      This facility is used in a few places so let's introduce
      a helper function to improve code readability.
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dzickus@redhat.com
      Cc: bmr@redhat.com
      Cc: jcastillo@redhat.com
      Cc: oleg@redhat.com
      Cc: riel@redhat.com
      Cc: prarit@redhat.com
      Cc: jgh@redhat.com
      Cc: minchan@kernel.org
      Cc: mpe@ellerman.id.au
      Cc: tglx@linutronix.de
      Cc: hannes@cmpxchg.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a70857e4
    • A
      init/main.c: Give init_task a canary · d4311ff1
      Aaron Tomlin 提交于
      Tasks get their end of stack set to STACK_END_MAGIC with the
      aim to catch stack overruns. Currently this feature does not
      apply to init_task. This patch removes this restriction.
      
      Note that a similar patch was posted by Prarit Bhargava
      some time ago but was never merged:
      
        http://marc.info/?l=linux-kernel&m=127144305403241&w=2Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dzickus@redhat.com
      Cc: bmr@redhat.com
      Cc: jcastillo@redhat.com
      Cc: jgh@redhat.com
      Cc: minchan@kernel.org
      Cc: tglx@linutronix.de
      Cc: hannes@cmpxchg.org
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4311ff1
  13. 16 9月, 2014 3 次提交
    • L
      x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data() · 8b375f64
      Luiz Capitulino 提交于
      The setup_node_data() function allocates a pg_data_t object,
      inserts it into the node_data[] array and initializes the
      following fields: node_id, node_start_pfn and
      node_spanned_pages.
      
      However, a few function calls later during the kernel boot,
      free_area_init_node() re-initializes those fields, possibly with
      setup_node_data() is not used.
      
      This causes a small glitch when running Linux as a hyperv numa
      guest:
      
        SRAT: PXM 0 -> APIC 0x00 -> Node 0
        SRAT: PXM 0 -> APIC 0x01 -> Node 0
        SRAT: PXM 1 -> APIC 0x02 -> Node 1
        SRAT: PXM 1 -> APIC 0x03 -> Node 1
        SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
        SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
        SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
        NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
        Initmem setup node 0 [mem 0x00000000-0x7fffffff]
          NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
        Initmem setup node 1 [mem 0x80800000-0x1081fffff]
          NODE_DATA [mem 0x1081ea000-0x1081fdfff]
        crashkernel: memory value expected
         [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
         [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
        Zone ranges:
          DMA      [mem 0x00001000-0x00ffffff]
          DMA32    [mem 0x01000000-0xffffffff]
          Normal   [mem 0x100000000-0x1081fffff]
        Movable zone start for each node
        Early memory node ranges
          node   0: [mem 0x00001000-0x0009efff]
          node   0: [mem 0x00100000-0x7ffeffff]
          node   1: [mem 0x80200000-0xf7ffffff]
          node   1: [mem 0x100000000-0x1081fffff]
        On node 0 totalpages: 524174
          DMA zone: 64 pages used for memmap
          DMA zone: 21 pages reserved
          DMA zone: 3998 pages, LIFO batch:0
          DMA32 zone: 8128 pages used for memmap
          DMA32 zone: 520176 pages, LIFO batch:31
        On node 1 totalpages: 524288
          DMA32 zone: 7672 pages used for memmap
          DMA32 zone: 491008 pages, LIFO batch:31
          Normal zone: 520 pages used for memmap
          Normal zone: 33280 pages, LIFO batch:7
      
      In this dmesg, the SRAT table reports that the memory range for
      node 1 starts at 0x80200000.  However, the line starting with
      "Initmem" reports that node 1 memory range starts at 0x80800000.
       The "Initmem" line is reported by setup_node_data() and is
      wrong, because the kernel ends up using the range as reported in
      the SRAT table.
      
      This commit drops all that dead code from setup_node_data(),
      renames it to alloc_node_data() and adds a printk() to
      free_area_init_node() so that we report a node's memory range
      accurately.
      
      Here's the same dmesg section with this patch applied:
      
         SRAT: PXM 0 -> APIC 0x00 -> Node 0
         SRAT: PXM 0 -> APIC 0x01 -> Node 0
         SRAT: PXM 1 -> APIC 0x02 -> Node 1
         SRAT: PXM 1 -> APIC 0x03 -> Node 1
         SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
         SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
         SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
         NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
         NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
         NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
         crashkernel: memory value expected
          [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
          [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
         Zone ranges:
           DMA      [mem 0x00001000-0x00ffffff]
           DMA32    [mem 0x01000000-0xffffffff]
           Normal   [mem 0x100000000-0x1081fffff]
         Movable zone start for each node
         Early memory node ranges
           node   0: [mem 0x00001000-0x0009efff]
           node   0: [mem 0x00100000-0x7ffeffff]
           node   1: [mem 0x80200000-0xf7ffffff]
           node   1: [mem 0x100000000-0x1081fffff]
         Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
         On node 0 totalpages: 524174
           DMA zone: 64 pages used for memmap
           DMA zone: 21 pages reserved
           DMA zone: 3998 pages, LIFO batch:0
           DMA32 zone: 8128 pages used for memmap
           DMA32 zone: 520176 pages, LIFO batch:31
         Initmem setup node 1 [mem 0x80200000-0x1081fffff]
         On node 1 totalpages: 524288
           DMA32 zone: 7672 pages used for memmap
           DMA32 zone: 491008 pages, LIFO batch:31
           Normal zone: 520 pages used for memmap
           Normal zone: 33280 pages, LIFO batch:7
      
      This commit was tested on a two node bare-metal NUMA machine and
      Linux as a numa guest on hyperv and qemu/kvm.
      
      PS: The wrong memory range reported by setup_node_data() seems to be
          harmless in the current kernel because it's just not used.  However,
          that bad range is used in kernel 2.6.32 to initialize the old boot
          memory allocator, which causes a crash during boot.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b375f64
    • Y
      x86/mm/hotplug: Modify PGD entry when removing memory · 9661d5bc
      Yasuaki Ishimatsu 提交于
      When hot-adding/removing memory, sync_global_pgds() is called
      for synchronizing PGD to PGD entries of all processes MM.  But
      when hot-removing memory, sync_global_pgds() does not work
      correctly.
      
      At first, sync_global_pgds() checks whether target PGD is none
      or not.  And if PGD is none, the PGD is skipped.  But when
      hot-removing memory, PGD may be none since PGD may be cleared by
      free_pud_table().  So when sync_global_pgds() is called after
      hot-removing memory, sync_global_pgds() should not skip PGD even
      if the PGD is none.  And sync_global_pgds() must clear PGD
      entries of all processes MM.
      
      Currently sync_global_pgds() does not clear PGD entries of all
      processes MM when hot-removing memory.  So when hot adding
      memory which is same memory range as removed memory after
      hot-removing memory, following call traces are shown:
      
       kernel BUG at arch/x86/mm/init_64.c:206!
       ...
       [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
       [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
       [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
       [<ffffffff815d03d9>] add_memory+0xb9/0x1b0
       [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
       [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
       [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
       [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
       [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
       [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
       [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
       [<ffffffff8107e02b>] process_one_work+0x17b/0x460
       [<ffffffff8107edfb>] worker_thread+0x11b/0x400
       [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
       [<ffffffff81085aef>] kthread+0xcf/0xe0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
      
      This patch clears PGD entries of all processes MM when
      sync_global_pgds() is called after hot-removing memory
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9661d5bc
    • Y
      x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable() · 5255e0a7
      Yasuaki Ishimatsu 提交于
      When hot-adding memory after hot-removing memory, following call
      traces are shown:
      
        kernel BUG at arch/x86/mm/init_64.c:206!
        ...
       [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
       [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
       [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
       [<ffffffff815d03d9>] add_memory+0xb9/0x1b0
       [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
       [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
       [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
       [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
       [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
       [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
       [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
       [<ffffffff8107e02b>] process_one_work+0x17b/0x460
       [<ffffffff8107edfb>] worker_thread+0x11b/0x400
       [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
       [<ffffffff81085aef>] kthread+0xcf/0xe0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
      
      The patch-set fixes the issue.
      
      This patch (of 2):
      
      remove_pagetable() gets start argument and passes the argument
      to sync_global_pgds().  In this case, the argument must not be
      modified.  If the argument is modified and passed to
      sync_global_pgds(), sync_global_pgds() does not correctly
      synchronize PGD to PGD entries of all processes MM since
      synchronized range of memory [start, end] is wrong.
      
      Unfortunately the start argument is modified in
      remove_pagetable().  So this patch fixes the issue.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5255e0a7
  14. 09 9月, 2014 2 次提交
  15. 01 9月, 2014 1 次提交
  16. 27 8月, 2014 1 次提交
    • C
      x86: Replace __get_cpu_var uses · 89cbc767
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      89cbc767
  17. 10 8月, 2014 1 次提交
  18. 08 8月, 2014 1 次提交
  19. 07 8月, 2014 1 次提交