1. 23 1月, 2015 1 次提交
    • D
      x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels · 814564a0
      Dave Hansen 提交于
      We had originally planned on submitting MPX support in one patch
      set.  We eventually broke it up in to two pieces for easier
      review.  One of the features that didn't make the first round
      was supporting 32-bit binaries on 64-bit kernels.
      
      Once we split the set up, we never added code to restrict 32-bit
      binaries from _using_ MPX on 64-bit kernels.
      
      The 32-bit bounds tables are a different format than the 64-bit
      ones.  Without this patch, the kernel will try to read a 32-bit
      binary's tables as if they were the 64-bit version.  They will
      likely be noticed as being invalid rather quickly and the app
      will get killed, but that's kinda mean.
      
      This patch adds an explicit check, and will make a 64-bit kernel
      essentially behave as if it has no MPX support when called from
      a 32-bit binary.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20150108223020.9E9AA511@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      814564a0
  2. 20 1月, 2015 1 次提交
  3. 23 12月, 2014 1 次提交
    • J
      x86: Fix step size adjustment during initial memory mapping · 132978b9
      Jan Beulich 提交于
      The old scheme can lead to failure in certain cases - the
      problem is that after bumping step_size the next (non-final)
      iteration is only guaranteed to make available a memory block
      the size of what step_size was before. E.g. for a memory block
      [0,3004600000) we'd have:
      
       iter	start		end		step		amount
       1	3004400000	30045fffff	 2M		  2M
       2	3004000000	30043fffff	64M		  4M
       3	3000000000	3003ffffff	 2G		 64M
       4	2000000000	2fffffffff	64G		 64G
      
      Yet to map 64G with 4k pages (as happens e.g. under PV Xen) we
      need slightly over 128M, but the first three iterations made
      only about 70M available.
      
      The condition (new_mapped_ram_size > mapped_ram_size) for
      bumping step_size is just not suitable. Instead we want to bump
      it when we know we have enough memory available to cover a block
      of the new step_size. And rather than making that condition more
      complicated than needed, simply adjust step_size by the largest
      possible factor we know we can cover at that point - which is
      shifting it left by one less than the difference between page
      table level shifts. (Interestingly the original STEP_SIZE_SHIFT
      definition had a comment hinting at that having been the
      intention, just that it should have been PUD_SHIFT-PMD_SHIFT-1
      instead of (PUD_SHIFT-PMD_SHIFT)/2, and of course for non-PAE
      32-bit we can't really use these two constants as they're equal
      there.)
      
      Furthermore the comment in get_new_step_size() didn't get
      updated when the bottom-down mapping logic got added. Yet while
      an overflow (flushing step_size to zero) of the shift doesn't
      matter for the top-down method, it does for bottom-up because
      round_up(x, 0) = 0, and an upper range boundary of zero can't
      really work well.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/54945C1E020000780005114E@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      132978b9
  4. 18 12月, 2014 2 次提交
  5. 16 12月, 2014 2 次提交
    • L
      x86: mm: consolidate VM_FAULT_RETRY handling · 26178ec1
      Linus Torvalds 提交于
      The VM_FAULT_RETRY handling was confusing and incorrect for the case of
      returning to kernel mode.  We need to handle the exception table fixup
      if we return to kernel mode due to a fatal signal - it will basically
      look to the kernel user mode access like the access failed due to the VM
      going away from udner it.  Which is correct - the process is dying - and
      avoids the whole "repeat endless kernel page faults" case.
      
      Handling the VM_FAULT_RETRY early and in just one place also simplifies
      the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we
      know that we can just drop the lock.  The remaining accounting and
      possible error handling is thread-local and does not need the mmap_sem.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26178ec1
    • L
      x86: mm: move mmap_sem unlock from mm_fault_error() to caller · 7fb08eca
      Linus Torvalds 提交于
      This replaces four copies in various stages of mm_fault_error() handling
      with just a single one.  It will also allow for more natural placement
      of the unlocking after some further cleanup.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fb08eca
  6. 14 12月, 2014 1 次提交
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
  7. 11 12月, 2014 1 次提交
    • X
      x86/mm: Fix zone ranges boot printout · c072b90c
      Xishi Qiu 提交于
      This is the usual physical memory layout boot printout:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
      	[    0.000000]   Normal   [mem 0x100000000-0xc3fffffff]
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0xbf78ffff]
      	[    0.000000]   node   0: [mem 0x100000000-0x63fffffff]
      	[    0.000000]   node   1: [mem 0x640000000-0xc3fffffff]
      	...
      
      This is the log when we set "mem=2G" on the boot cmdline:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]  // should be 0x7fffffff, right?
      	[    0.000000]   Normal   empty
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
      	...
      
      This patch fixes the printout, the following log shows the right
      ranges:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0x7fffffff]
      	[    0.000000]   Normal   empty
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
      	...
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Linux MM <linux-mm@kvack.org>
      Cc: <dave@sr71.net>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c072b90c
  8. 08 12月, 2014 1 次提交
  9. 04 12月, 2014 1 次提交
  10. 19 11月, 2014 2 次提交
    • D
      x86 mpx: Change return type of get_reg_offset() · 68c009c4
      Dave Hansen 提交于
      get_reg_offset() used to return the register contents themselves
      instead of the register offset.  When it did that, it was an
      unsigned long.  I changed it to return an integer _offset_
      instead of the register.  But, I neglected to change the return
      type of the function or the variables in which we store the
      result of the call.
      
      This fixes up the code to clear up the warnings from the smatch
      bot:
      
      New smatch warnings:
      arch/x86/mm/mpx.c:178 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      arch/x86/mm/mpx.c:184 mpx_get_addr_ref() warn: unsigned 'base_offset' is never less than zero.
      arch/x86/mm/mpx.c:188 mpx_get_addr_ref() warn: unsigned 'indx_offset' is never less than zero.
      arch/x86/mm/mpx.c:196 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182343.C3E0C629@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      68c009c4
    • K
      x86, mm: Set NX across entire PMD at boot · 45e2a9d4
      Kees Cook 提交于
      When setting up permissions on kernel memory at boot, the end of the
      PMD that was split from bss remained executable. It should be NX like
      the rest. This performs a PMD alignment instead of a PAGE alignment to
      get the correct span of memory.
      
      Before:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82c00000    10M     RW   PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82df5000  2004K     RW       GLB NX pte
      0xffffffff82df5000-0xffffffff82e00000    44K     RW       GLB x  pte
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      After:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82e00000    12M     RW   PSE GLB NX pmd
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment.
              We really should unmap the reminder along with the holes
              caused by init,initdata etc. but thats a different issue ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      45e2a9d4
  11. 18 11月, 2014 4 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
    • D
      x86, mpx: Decode MPX instruction to get bound violation information · fcc7ffd6
      Dave Hansen 提交于
      This patch sets bound violation fields of siginfo struct in #BR
      exception handler by decoding the user instruction and constructing
      the faulting pointer.
      
      We have to be very careful when decoding these instructions.  They
      are completely controlled by userspace and may be changed at any
      time up to and including the point where we try to copy them in to
      the kernel.  They may or may not be MPX instructions and could be
      completely invalid for all we know.
      
      Note: This code is based on Qiaowei Ren's specialized MPX
      decoder, but uses the generic decoder whenever possible.  It was
      tested for robustness by generating a completely random data
      stream and trying to decode that stream.  I also unmapped random
      pages inside the stream to test the "partial instruction" short
      read code.
      
      We kzalloc() the siginfo instead of stack allocating it because
      we need to memset() it anyway, and doing this makes it much more
      clear when it got initialized by the MPX instruction decoder.
      
      Changes from the old decoder:
       * Use the generic decoder instead of custom functions.  Saved
         ~70 lines of code overall.
       * Remove insn->addr_bytes code (never used??)
       * Make sure never to possibly overflow the regoff[] array, plus
         check the register range correctly in 32 and 64-bit modes.
       * Allow get_reg() to return an error and have mpx_get_addr_ref()
         handle when it sees errors.
       * Only call insn_get_*() near where we actually use the values
         instead if trying to call them all at once.
       * Handle short reads from copy_from_user() and check the actual
         number of read bytes against what we expect from
         insn_get_length().  If a read stops in the middle of an
         instruction, we error out.
       * Actually check the opcodes intead of ignoring them.
       * Dynamically kzalloc() siginfo_t so we don't leak any stack
         data.
       * Detect and handle decoder failures instead of ignoring them.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151828.5BDD0915@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fcc7ffd6
    • Q
      x86, mpx: Add MPX-specific mmap interface · 57319d80
      Qiaowei Ren 提交于
      We have chosen to perform the allocation of bounds tables in
      kernel (See the patch "on-demand kernel allocation of bounds
      tables") and to mark these VMAs with VM_MPX.
      
      However, there is currently no suitable interface to actually do
      this.  Existing interfaces, like do_mmap_pgoff(), have no way to
      set a modified ->vm_ops or ->vm_flags and don't hold mmap_sem
      long enough to let a caller do it.
      
      This patch wraps mmap_region() and hold mmap_sem long enough to
      make the modifications to the VMA which we need.
      
      Also note the 32/64-bit #ifdef in the header.  We actually need
      to do this at runtime eventually.  But, for now, we don't support
      running 32-bit binaries on 64-bit kernels.  Support for this will
      come in later patches.
      Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151827.CE440F67@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      57319d80
  12. 17 11月, 2014 1 次提交
    • T
      x86: mm: Move PAT only functions to mm/pat.c · 0dbcae88
      Thomas Gleixner 提交于
      Commit e00c8cc9 "x86: Use new cache mode type in memtype related
      functions" broke the ARCH=um build.
      
       arch/x86/include/asm/cacheflush.h:67:36: error: return type is an incomplete type
       static inline enum page_cache_mode get_page_memtype(struct page *pg)
      
      The reason is simple. get_page_memtype() and set_page_memtype()
      require enum page_cache_mode now, which is defined in
      asm/pgtable_types.h. UM does not include that file for obvious reasons.
      
      The simple solution is to move that functions to arch/x86/mm/pat.c
      where the only callsites of this are located. They should have been
      there in the first place.
      
      Fixes: e00c8cc9 "x86: Use new cache mode type in memtype related functions"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      0dbcae88
  13. 16 11月, 2014 12 次提交
  14. 12 11月, 2014 1 次提交
  15. 10 11月, 2014 1 次提交
    • T
      /dev/mem: Use more consistent data types · 4707a341
      Thierry Reding 提交于
      The xlate_dev_{kmem,mem}_ptr() functions take either a physical address
      or a kernel virtual address, so data types should be phys_addr_t and
      void *. They both return a kernel virtual address which is only ever
      used in calls to copy_{from,to}_user(), so make variables that store it
      void * rather than char * for consistency.
      
      Also only define a weak unxlate_dev_mem_ptr() function if architectures
      haven't overridden them in the asm/io.h header file.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      4707a341
  16. 05 11月, 2014 1 次提交
  17. 29 10月, 2014 1 次提交
  18. 28 10月, 2014 1 次提交
  19. 14 10月, 2014 2 次提交
    • X
      arch/x86/mm/numa.c: fix boot failure when all nodes are hotpluggable · bd5cfb89
      Xishi Qiu 提交于
      If all the nodes are marked hotpluggable, alloc node data will fail.
      Because __next_mem_range_rev() will skip the hotpluggable memory
      regions.  numa_clear_kernel_node_hotplug() is called after alloc node
      data.
      
      numa_init()
          ...
          ret = init_func();  // this will mark hotpluggable flag from SRAT
          ...
          memblock_set_bottom_up(false);
          ...
          ret = numa_register_memblks(&numa_meminfo);  // this will alloc node data(pglist_data)
          ...
          numa_clear_kernel_node_hotplug();  // in case all the nodes are hotpluggable
          ...
      
      numa_register_memblks()
          setup_node_data()
              memblock_find_in_range_node()
                  __memblock_find_range_top_down()
                      for_each_mem_range_rev()
                          __next_mem_range_rev()
      
      This patch moves numa_clear_kernel_node_hotplug() into
      numa_register_memblks(), clear kernel node hotpluggable flag before
      alloc node data, then alloc node data won't fail even all the nodes
      are hotpluggable.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd5cfb89
    • M
      x86: use optimized ioresource lookup in ioremap function · 906e36c5
      Mike Travis 提交于
      Use the optimized ioresource lookup, "region_is_ram", for the ioremap
      function.  If the region is not found, it falls back to the
      "page_is_ram" function.  If it is found and it is RAM, then the usual
      warning message is issued, and the ioremap operation is aborted.
      Otherwise, the ioremap operation continues.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: NCliff Wickman <cpw@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906e36c5
  20. 23 9月, 2014 2 次提交
    • D
      x86: remove the Xen-specific _PAGE_IOMAP PTE flag · f955371c
      David Vrabel 提交于
      The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
      that were used to map I/O regions that are 1:1 in the p2m.  This
      allowed Xen to obtain the correct PFN when converting the MFNs read
      from a PTE back to their PFN.
      
      Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
      returns the correct PFN by using a combination of the m2p and p2m to
      determine if an MFN corresponds to a 1:1 mapping in the the p2m.
      
      Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
      future uses of the PTE flag.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      f955371c
    • D
      x86: skip check for spurious faults for non-present faults · 31668511
      David Vrabel 提交于
      If a fault on a kernel address is due to a non-present page, then it
      cannot be the result of stale TLB entry from a protection change (RO
      to RW or NX to X).  Thus the pagetable walk in spurious_fault() can be
      skipped.
      
      See the initial if in spurious_fault() and the tests in
      spurious_fault_check()) for the set of possible error codes checked
      for spurious faults.  These are:
      
               IRUWP
      Before   x00xx && ( 1xxxx || xxx1x )
      After  ( 10001 || 00011 ) && ( 1xxxx || xxx1x )
      
      Thus the new condition is a subset of the previous one, excluding only
      non-present faults (I == 1 and W == 1 are mutually exclusive).
      
      This avoids spurious_fault() oopsing in some cases if the pagetables
      it attempts to walk are not accessible.  This obscures the location of
      the original fault.
      
      This also fixes a crash with Xen PV guests when they access entries in
      the M2P corresponding to device MMIO regions.  The M2P is mapped
      (read-only) by Xen into the kernel address space of the guest and this
      mapping may contains holes for non-RAM regions.  Read faults will
      result in calls to spurious_fault(), but because the page tables for
      the M2P mappings are not accessible by the guest the pagetable walk
      would fault.
      
      This was not normally a problem as MMIO mappings would not normally
      result in a M2P lookup because of the use of the _PAGE_IOMAP bit the
      PTE.  However, removing the _PAGE_IOMAP bit requires M2P lookups for
      MMIO mappings as well.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      31668511
  21. 19 9月, 2014 1 次提交
    • A
      sched: Add helper for task stack page overrun checking · a70857e4
      Aaron Tomlin 提交于
      This facility is used in a few places so let's introduce
      a helper function to improve code readability.
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dzickus@redhat.com
      Cc: bmr@redhat.com
      Cc: jcastillo@redhat.com
      Cc: oleg@redhat.com
      Cc: riel@redhat.com
      Cc: prarit@redhat.com
      Cc: jgh@redhat.com
      Cc: minchan@kernel.org
      Cc: mpe@ellerman.id.au
      Cc: tglx@linutronix.de
      Cc: hannes@cmpxchg.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a70857e4