1. 19 5月, 2020 1 次提交
  2. 08 4月, 2020 1 次提交
  3. 03 4月, 2020 3 次提交
    • P
      mm: allow VM_FAULT_RETRY for multiple times · 4064b982
      Peter Xu 提交于
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4064b982
    • P
      mm: introduce FAULT_FLAG_DEFAULT · dde16072
      Peter Xu 提交于
      Although there're tons of arch-specific page fault handlers, most of them
      are still sharing the same initial value of the page fault flags.  Say,
      merely all of the page fault handlers would allow the fault to be retried,
      and they also allow the fault to respond to SIGKILL.
      
      Let's define a default value for the fault flags to replace those initial
      page fault flags that were copied over.  With this, it'll be far easier to
      introduce new fault flag that can be used by all the architectures instead
      of touching all the archs.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dde16072
    • P
      x86/mm: use helper fault_signal_pending() · 39678191
      Peter Xu 提交于
      Let's move the fatal signal check even earlier so that we can directly use
      the new fault_signal_pending() in x86 mm code.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-5-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39678191
  4. 22 3月, 2020 1 次提交
  5. 07 1月, 2020 1 次提交
    • F
      x86/context-tracking: Remove exception_enter/exit() from do_page_fault() · ee6352b2
      Frederic Weisbecker 提交于
      do_page_fault(), like other exceptions, is already covered by
      user_enter() and user_exit() when the exception triggers in userspace.
      
      As explained in:
      
        8c84014f ("x86/entry: Remove exception_enter() from most trap handlers")
      
      exception_enter/exit() only remained to handle possible page fault from
      kernel mode while context tracking is in CONTEXT_USER mode, ie: on
      kernel entry before we manage to call user_exit(). The only known
      offender was do_fast_syscall_32() fetching EBP register from where
      vDSO stashed it.
      
      Meanwhile this got fixed in:
      
        9999c8c0 ("x86/entry: Call enter_from_user_mode() with IRQs off")
      
      that moved enter_from_user_mode() before the call to get_user().
      
      So we can safely remove it now.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Link: https://lkml.kernel.org/r/20191227163612.10039-2-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ee6352b2
  6. 10 12月, 2019 1 次提交
    • I
      mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions · 186525bd
      Ingo Molnar 提交于
      - Untangle the somewhat incestous way of how VMALLOC_START is used all across the
        kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
        It doesn't help that vmalloc.h only includes a single asm header:
      
           #include <asm/page.h>           /* pgprot_t */
      
        So there was no existing cross-arch way to decouple address layout
        definitions from page.h details. I used this:
      
         #ifndef VMALLOC_START
         # include <asm/vmalloc.h>
         #endif
      
        This way every architecture that wants to simplify page.h can do so.
      
      - Also on x86 we had a couple of LDT related inline functions that used
        the late-stage address space layout positions - but these could be
        uninlined without real trouble - the end result is cleaner this way as
        well.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      186525bd
  7. 27 11月, 2019 1 次提交
    • J
      x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() · 9a62d200
      Joerg Roedel 提交于
      The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc()
      ranges: before such vmap ranges are reused we make sure that they are
      unmapped from every task's page tables.
      
      This is really easy on pagetable setups where the kernel page tables
      are shared between all tasks - this is the case on 32-bit kernels
      with SHARED_KERNEL_PMD = 1.
      
      But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating
      over the pgd_list and clearing all pmd entries in the pgds that
      are cleared in the init_mm.pgd, which is the reference pagetable
      that the vmalloc() code uses.
      
      In that context the current practice of vmalloc_sync_all() iterating
      until FIX_ADDR_TOP is buggy:
      
              for (address = VMALLOC_START & PMD_MASK;
                   address >= TASK_SIZE_MAX && address < FIXADDR_TOP;
                   address += PMD_SIZE) {
                      struct page *page;
      
      Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc
      address ranges:
      
      	VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR
      
      This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges
      that don't clear their pmds, but it's lethal for the LDT range,
      which relies on having different mappings in different processes,
      and 'synchronizing' them in the vmalloc sense corrupts those
      pagetable entries (clearing them).
      
      This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD
      off and makes this the dominant mapping mode on 32-bit.
      
      To make LDT working again vmalloc_sync_all() must only iterate over
      the volatile parts of the kernel address range that are identical
      between all processes.
      
      So the correct check in vmalloc_sync_all() is "address < VMALLOC_END"
      to make sure the VMALLOC areas are synchronized and the LDT
      mapping is not falsely overwritten.
      
      The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either,
      but this is not really a proplem since their PMDs get established
      during bootup and never change.
      
      This change fixes the ldt_gdt selftest in my setup.
      
      [ mingo: Fixed up the changelog to explain the logic and modified the
               copying to only happen up until VMALLOC_END. ]
      Reported-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Fixes: 7757d607: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
      Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a62d200
  8. 22 7月, 2019 2 次提交
  9. 18 7月, 2019 1 次提交
  10. 17 7月, 2019 1 次提交
    • A
      mm, kprobes: generalize and rename notify_page_fault() as kprobe_page_fault() · b98cca44
      Anshuman Khandual 提交于
      Architectures which support kprobes have very similar boilerplate around
      calling kprobe_fault_handler().  Use a helper function in kprobes.h to
      unify them, based on the x86 code.
      
      This changes the behaviour for other architectures when preemption is
      enabled.  Previously, they would have disabled preemption while calling
      the kprobe handler.  However, preemption would be disabled if this fault
      was due to a kprobe, so we know the fault was not due to a kprobe
      handler and can simply return failure.
      
      This behaviour was introduced in commit a980c0ef ("x86/kprobes:
      Refactor kprobes_fault() like kprobe_exceptions_notify()")
      
      [anshuman.khandual@arm.com: export kprobe_fault_handler()]
        Link: http://lkml.kernel.org/r/1561133358-8876-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1560420444-25737-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b98cca44
  11. 28 6月, 2019 2 次提交
  12. 03 6月, 2019 1 次提交
    • E
      signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus · 318759b4
      Eric W. Biederman 提交于
      Stephen Rothwell <sfr@canb.auug.org.au> reported:
      > After merging the userns tree, today's linux-next build (i386 defconfig)
      > produced this warning:
      >
      > arch/x86/mm/fault.c: In function 'do_sigbus':
      > arch/x86/mm/fault.c:1017:22: warning: unused variable 'tsk' [-Wunused-variable]
      >   struct task_struct *tsk = current;
      >                       ^~~
      >
      > Introduced by commit
      >
      >   351b6825 ("signal: Explicitly call force_sig_fault on current")
      >
      > The remaining used of "tsk" are protected by CONFIG_MEMORY_FAILURE.
      
      So do the obvious thing and move tsk inside of CONFIG_MEMORY_FAILURE
      to prevent introducing new warnings into the build.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      318759b4
  13. 29 5月, 2019 2 次提交
    • E
      signal: Remove the task parameter from force_sig_fault · 2e1661d2
      Eric W. Biederman 提交于
      As synchronous exceptions really only make sense against the current
      task (otherwise how are you synchronous) remove the task parameter
      from from force_sig_fault to make it explicit that is what is going
      on.
      
      The two known exceptions that deliver a synchronous exception to a
      stopped ptraced task have already been changed to
      force_sig_fault_to_task.
      
      The callers have been changed with the following emacs regular expression
      (with obvious variations on the architectures that take more arguments)
      to avoid typos:
      
      force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
      ->
      force_sig_fault(\1,\2,\3)
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2e1661d2
    • E
      signal: Explicitly call force_sig_fault on current · 351b6825
      Eric W. Biederman 提交于
      Update the calls of force_sig_fault that pass in a variable that is
      set to current earlier to explicitly use current.
      
      This is to make the next change that removes the task parameter
      from force_sig_fault easier to verify.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      351b6825
  14. 27 5月, 2019 1 次提交
  15. 24 4月, 2019 1 次提交
  16. 22 4月, 2019 1 次提交
    • B
      x86/fault: Make fault messages more succinct · ea2f8d60
      Borislav Petkov 提交于
      So we are going to be staring at those in the next years, let's make
      them more succinct. In particular:
      
       - change "address = " to "address: "
      
       - "-privileged" reads funny. It should be simply "kernel" or "user"
      
       - "from kernel code" reads funny too. "kernel mode" or "user mode" is
         more natural.
      
      An actual example says more than 1000 words, of course:
      
        [    0.248370] BUG: kernel NULL pointer dereference, address: 00000000000005b8
        [    0.249120] #PF: supervisor write access in kernel mode
        [    0.249717] #PF: error_code(0x0002) - not-present page
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave.hansen@linux.intel.com
      Cc: luto@kernel.org
      Cc: riel@surriel.com
      Cc: sean.j.christopherson@intel.com
      Cc: yu-cheng.yu@intel.com
      Link: http://lkml.kernel.org/r/20190421183524.GC6048@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea2f8d60
  17. 20 4月, 2019 2 次提交
    • S
      x86/fault: Decode and print #PF oops in human readable form · 18ea35c5
      Sean Christopherson 提交于
      Linus pointed out that deciphering the raw #PF error code and printing
      a more human readable message are two different things, and also that
      printing the negative cases is mostly just noise[1].  For example, the
      USER bit doesn't mean the fault originated in user code and stating
      that an oops wasn't due to a protection keys violation isn't interesting
      since an oops on a keys violation is a one-in-a-million scenario.
      
      Remove the per-bit decoding of the error code and instead print:
        - the raw error code
        - why the fault occurred
        - the effective privilege level of the access
        - the type of access
        - whether the fault originated in user code or kernel code
      
      This provides the user with the information needed to triage 99.9% of
      oopses without polluting the log with useless information or conflating
      the error_code with the CPL.
      
      Sample output:
      
          BUG: kernel NULL pointer dereference, address = 0000000000000008
          #PF: supervisor-privileged instruction fetch from kernel code
          #PF: error_code(0x0010) - not-present page
      
          BUG: unable to handle page fault for address = ffffbeef00000000
          #PF: supervisor-privileged instruction fetch from kernel code
          #PF: error_code(0x0010) - not-present page
      
          BUG: unable to handle page fault for address = ffffc90000230000
          #PF: supervisor-privileged write access from kernel code
          #PF: error_code(0x000b) - reserved bit violation
      
      [1] https://lkml.kernel.org/r/CAHk-=whk_fsnxVMvF1T2fFCaP2WrvSybABrLQCWLJyCvHw6NKA@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20181221213657.27628-3-sean.j.christopherson@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18ea35c5
    • S
      x86/fault: Reword initial BUG message for unhandled page faults · f28b11a2
      Sean Christopherson 提交于
      Reword the NULL pointer dereference case to simply state that a NULL
      pointer was dereferenced, i.e. drop "unable to handle" as that implies
      that there are instances where the kernel actual does handle NULL
      pointer dereferences, which is not true barring funky exception fixup.
      
      For the non-NULL case, replace "kernel paging request" with "page fault"
      as the kernel can technically oops on faults that originated in user
      code.  Dropping "kernel" also allows future patches to provide detailed
      information on where the fault occurred, e.g. user vs. kernel, without
      conflicting with the initial BUG message.
      
      In both cases, replace "at address=" with wording more appropriate to
      the oops, as "at" may be interpreted as stating that the address is the
      RIP of the instruction that faulted.
      
      Last, and probably least, further qualify the NULL-pointer path by
      checking that the fault actually originated in kernel code.  It's
      technically possible for userspace to map address 0, and not printing
      a super specific message is the least of our worries if the kernel does
      manage to oops on an actual NULL pointer dereference from userspace.
      
      Before:
          BUG: unable to handle kernel NULL pointer dereference at ffffbeef00000000
          BUG: unable to handle kernel paging request at ffffbeef00000000
      
      After:
          BUG: kernel NULL pointer dereference, address = 0000000000000008
          BUG: unable to handle page fault for address = ffffbeef00000000
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20181221213657.27628-2-sean.j.christopherson@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f28b11a2
  18. 17 4月, 2019 2 次提交
    • T
      x86/traps: Use cpu_entry_area instead of orig_ist · d876b673
      Thomas Gleixner 提交于
      The orig_ist[] array is a shadow copy of the IST array in the TSS. The
      reason why it exists is that older kernels used two TSS variants with
      different pointers into the debug stack. orig_ist[] contains the real
      starting points.
      
      There is no point anymore to do so because the same information can be
      retrieved using the base address of the cpu entry area mapping and the
      offsets of the various exception stacks.
      
      No functional change. Preparation for removing orig_ist.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160144.784487230@linutronix.de
      d876b673
    • T
      x86/exceptions: Make IST index zero based · 8f34c5b5
      Thomas Gleixner 提交于
      The defines for the exception stack (IST) array in the TSS are using the
      SDM convention IST1 - IST7. That causes all sorts of code to subtract 1 for
      array indices related to IST. That's confusing at best and does not provide
      any value.
      
      Make the indices zero based and fixup the usage sites. The only code which
      needs to adjust the 0 based index is the interrupt descriptor setup which
      needs to add 1 now.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190414160144.331772825@linutronix.de
      8f34c5b5
  19. 08 3月, 2019 1 次提交
  20. 30 1月, 2019 1 次提交
    • C
      x86/fault: Fix sign-extend unintended sign extension · 5ccd3528
      Colin Ian King 提交于
      show_ldttss() shifts desc.base2 by 24 bit, but base2 is 8 bits of a
      bitfield in a u16.
      
      Due to the really great idea of integer promotion in C99 base2 is promoted
      to an int, because that's the standard defined behaviour when all values
      which can be represented by base2 fit into an int.
      
      Now if bit 7 is set in desc.base2 the result of the shift left by 24 makes
      the resulting integer negative and the following conversion to unsigned
      long legitmately sign extends first causing the upper bits 32 bits to be
      set in the result.
      
      Fix this by casting desc.base2 to unsigned long before the shift.
      
      Detected by CoverityScan, CID#1475635 ("Unintended sign extension")
      
      [ tglx: Reworded the changelog a bit as I actually had to lookup
        	the standard (again) to decode the original one. ]
      
      Fixes: a1a371c4 ("x86/fault: Decode page fault OOPSes better")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: kernel-janitors@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181222191116.21831-1-colin.king@canonical.com
      5ccd3528
  21. 22 11月, 2018 4 次提交
    • I
      x86/fault: Clean up the page fault oops decoder a bit · a2aa52ab
      Ingo Molnar 提交于
       - Make the oops messages a bit less scary (don't mention 'HW errors')
      
       - Turn 'PROT USER' (which is visually easily confused with PROT_USER)
         into individual bit descriptors: "[PROT] [USER]".
         This also makes "[normal kernel read fault]" more apparent.
      
       - De-abbreviate variables to make the code easier to read
      
       - Use vertical alignment where appropriate.
      
       - Add comment about string size limits and the helper function.
      
       - Remove unnecessary line breaks.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a2aa52ab
    • A
      x86/fault: Decode page fault OOPSes better · a1a371c4
      Andy Lutomirski 提交于
      One of Linus' favorite hobbies seems to be looking at OOPSes and
      decoding the error code in his head.  This is not one of my favorite
      hobbies :)
      
      Teach the page fault OOPS hander to decode the error code.  If it's
      a !USER fault from user mode, print an explicit note to that effect
      and print out the addresses of various tables that might cause such
      an error.
      
      With this patch applied, if I intentionally point the LDT at 0x0 and
      run the x86 selftests, I get:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
        HW error: normal kernel read fault
        This was a system access from user code
        IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe0000001000 (limit=0x7f)
        LDTR: 0x50 -- base=0x0 limit=0xfff7
        TR: 0x40 -- base=0xfffffe0000003000 limit=0x206f
        PGD 800000000456e067 P4D 800000000456e067 PUD 4623067 PMD 0
        SMP PTI
        CPU: 0 PID: 153 Comm: ldt_gdt_64 Not tainted 4.19.0+ #1317
        Hardware name: ...
        RIP: 0033:0x401454
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/11212acb25980cd1b3030875cd9502414fbb214d.1542841400.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1a371c4
    • A
      x86/fault: Don't try to recover from an implicit supervisor access · ebb53e25
      Andy Lutomirski 提交于
      This avoids a situation in which we attempt to apply various fixups
      that are not intended to handle implicit supervisor accesses from
      user mode if we screw up in a way that causes this type of fault.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/9999f151d72ff352265f3274c5ab3a4105090f49.1542841400.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ebb53e25
    • A
      x86/fault: Remove sw_error_code · 0ed32f1a
      Andy Lutomirski 提交于
      All of the fault handling code now corrently checks user_mode(regs)
      as needed, and nothing depends on the X86_PF_USER bit being munged.
      Get rid of the sw_error code and use hw_error_code everywhere.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/078f5b8ae6e8c79ff8ee7345b5c476c45003e5ac.1542841400.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ed32f1a
  22. 20 11月, 2018 7 次提交
    • A
      x86/fault: Don't set thread.cr2, etc before OOPSing · 1ad33f5a
      Andy Lutomirski 提交于
      The fault handling code sets the cr2, trap_nr, and error_code fields
      in thread_struct before OOPSing.  No one reads those fields during
      an OOPS, so remove the code to set them.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/d418022aa0fad9cb40467aa7acaf4e95be50ee96.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1ad33f5a
    • A
      x86/fault: Make error_code sanitization more robust · e49d3cbe
      Andy Lutomirski 提交于
      The error code in a page fault on a kernel address indicates
      whether that address is mapped, which should not be revealed in a signal.
      
      The normal code path for a page fault on a kernel address sanitizes the bit,
      but the paths for vsyscall emulation and SIGBUS do not.  Both are
      harmless, but for subtle reasons.  SIGBUS is never sent for a kernel
      address, and vsyscall emulation will never fault on a kernel address
      per se because it will fail an access_ok() check instead.
      
      Make the code more robust by adding a helper that sets the relevant
      fields and sanitizing the error code in the helper.  This also
      cleans up the code -- we had three copies of roughly the same thing.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/b31159bd55bd0c4fa061a20dfd6c429c094bebaa.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e49d3cbe
    • A
      x86/fault: Improve the condition for signalling vs OOPSing · 6ea59b07
      Andy Lutomirski 提交于
      __bad_area_nosemaphore() currently checks the X86_PF_USER bit in the
      error code to decide whether to send a signal or to treat the fault
      as a kernel error.  This can cause somewhat erratic behavior.  The
      straightforward cases where the CPL agrees with the hardware USER
      bit are all correct, but the other cases are confusing.
      
       - A user instruction accessing a kernel address with supervisor
         privilege (e.g. a descriptor table access failed).  The USER bit
         will be clear, and we OOPS.  This is correct, because it indicates
         a kernel bug, not a user error.
      
       - A user instruction accessing a user address with supervisor
         privilege (e.g. a descriptor table was incorrectly pointing at
         user memory).  __bad_area_nosemaphore() will be passed a modified
         error code with the user bit set, and we will send a signal.
         Sending the signal will work (because the regs and the entry
         frame genuinely come from user mode), but we really ought to
         OOPS, as this event indicates a severe kernel bug.
      
       - A kernel instruction with user privilege (i.e. WRUSS).  This
         should OOPS or get fixed up.  The current code would instead try
         send a signal and malfunction.
      
      Change the logic: a signal should be sent if the faulting context is
      user mode *and* the access has user privilege.  Otherwise it's
      either a kernel mode fault or a failed implicit access, either of
      which should end up in no_context().
      
      Note to -stable maintainers: don't backport this unless you backport
      CET.  The bug it fixes is unobservable in current kernels unless
      something is extremely wrong.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/10e509c43893170e262e82027ea399130ae81159.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ea59b07
    • A
      x86/fault: Fix SMAP #PF handling buglet for implicit supervisor accesses · e50928d7
      Andy Lutomirski 提交于
      Currently, if a user program somehow triggers an implicit supervisor
      access to a user address (e.g. if the kernel somehow sets LDTR to a
      user address), it will be incorrectly detected as a SMAP violation
      if AC is clear and SMAP is enabled.  This is incorrect -- the error
      has nothing to do with SMAP.  Fix the condition so that only
      accesses with the hardware USER bit set are diagnosed as SMAP
      violations.
      
      With the logic fixed, an implicit supervisor access to a user address
      will hit the code lower in the function that is intended to handle it
      even if SMAP is enabled.  That logic is still a bit buggy, and later
      patches will clean it up.
      
      I *think* this code is still correct for WRUSS, and I've added a
      comment to that effect.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/d1d1b2e66ef31f884dba172084486ea9423ddcdb.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e50928d7
    • A
      x86/fault: Fold smap_violation() into do_user_addr_fault() · a15781b5
      Andy Lutomirski 提交于
      smap_violation() has a single caller, and the contents are a bit
      nonsensical.  I'm going to fix it, but first let's fold it into its
      caller for ease of comprehension.
      
      In this particular case, the user_mode(regs) check is incorrect --
      it will cause false positives in the case of a user-initiated
      kernel-privileged access.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/806c366f6ca861152398ce2c01744d59d9aceb6d.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a15781b5
    • A
      x86/cpufeatures, x86/fault: Mark SMAP as disabled when configured out · dae0a105
      Andy Lutomirski 提交于
      Add X86_FEATURE_SMAP to the disabled features mask as appropriate
      and use cpu_feature_enabled() in the fault code.  This lets us get
      rid of a redundant IS_ENABLED(CONFIG_X86_SMAP).
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/fe93332eded3d702f0b0b4cf83928d6830739ba3.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dae0a105
    • A
      x86/fault: Check user_mode(regs) when avoiding an mmap_sem deadlock · 6344be60
      Andy Lutomirski 提交于
      The fault-handling code that takes mmap_sem needs to avoid a
      deadlock that could occur if the kernel took a bad (OOPS-worthy)
      page fault on a user address while holding mmap_sem.  This can only
      happen if the faulting instruction was in the kernel
      (i.e. user_mode(regs)).  Rather than checking the sw_error_code
      (which will have the USER bit set if the fault was a USER-permission
      access *or* if user_mode(regs)), just check user_mode(regs)
      directly.
      
      The old code would have malfunctioned if the kernel executed a bogus
      WRUSS instruction while holding mmap_sem.  Fortunately, that is
      extremely unlikely in current kernels, which don't use WRUSS.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/4b89b542e8ceba9bd6abde2f386afed6d99244a9.1542667307.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6344be60
  23. 12 11月, 2018 1 次提交
    • W
      x86/mm/fault: Allow stack access below %rsp · 1d8ca3be
      Waiman Long 提交于
      The current x86 page fault handler allows stack access below the stack
      pointer if it is no more than 64k+256 bytes. Any access beyond the 64k+
      limit will cause a segmentation fault.
      
      The gcc -fstack-check option generates code to probe the stack for
      large stack allocation to see if the stack is accessible. The newer gcc
      does that while updating the %rsp simultaneously. Older gcc's like gcc4
      doesn't do that. As a result, an application compiled with an old gcc
      and the -fstack-check option may fail to start at all:
      
        $ cat test.c
        int main() {
      	char tmp[1024*128];
      	printf("### ok\n");
      	return 0;
        }
      
        $ gcc -fstack-check -g -o test test.c
      
        $ ./test
        Segmentation fault
      
      The old binary was working in older kernels where expand_stack() was
      somehow called before the check. But it is not working in newer kernels.
      Besides, the 64k+ limit check is kind of crude and will not catch a
      lot of mistakes that userspace applications may be misbehaving anyway.
      I think the kernel isn't the right place for this kind of tests. We
      should leave it to userspace instrumentation tools to perform them.
      
      The 64k+ limit check is now removed to just let expand_stack() decide
      if a segmentation fault should happen, when the RLIMIT_STACK limit is
      exceeded, for example.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1541535149-31963-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1d8ca3be
  24. 31 10月, 2018 1 次提交