1. 18 3月, 2017 4 次提交
  2. 16 3月, 2017 1 次提交
    • T
      x86/mm: Adapt MODULES_END based on fixmap section size · f06bdd40
      Thomas Garnier 提交于
      This patch aligns MODULES_END to the beginning of the fixmap section.
      It optimizes the space available for both sections. The address is
      pre-computed based on the number of pages required by the fixmap
      section.
      
      It will allow GDT remapping in the fixmap section. The current
      MODULES_END static address does not provide enough space for the kernel
      to support a large number of processors.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Luis R . Rodriguez <mcgrof@kernel.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kernel-hardening@lists.openwall.com
      Cc: kvm@vger.kernel.org
      Cc: lguest@lists.ozlabs.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-pm@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Cc: zijun_hu <zijun_hu@htc.com>
      Link: http://lkml.kernel.org/r/20170314170508.100882-1-thgarnie@google.com
      [ Small build fix. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f06bdd40
  3. 13 3月, 2017 1 次提交
  4. 10 3月, 2017 11 次提交
    • D
      kasan: fix races in quarantine_remove_cache() · ce5bec54
      Dmitry Vyukov 提交于
      quarantine_remove_cache() frees all pending objects that belong to the
      cache, before we destroy the cache itself.  However there are currently
      two possibilities how it can fail to do so.
      
      First, another thread can hold some of the objects from the cache in
      temp list in quarantine_put().  quarantine_put() has a windows of
      enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
      finish right in that window.  These objects will be later freed into the
      destroyed cache.
      
      Then, quarantine_reduce() has the same problem.  It grabs a batch of
      objects from the global quarantine, then unlocks quarantine_lock and
      then frees the batch.  quarantine_remove_cache() can finish while some
      objects from the cache are still in the local to_free list in
      quarantine_reduce().
      
      Fix the race with quarantine_put() by disabling interrupts for the whole
      duration of quarantine_put().  In combination with on_each_cpu() in
      quarantine_remove_cache() it ensures that quarantine_remove_cache()
      either sees the objects in the per-cpu list or in the global list.
      
      Fix the race with quarantine_reduce() by protecting quarantine_reduce()
      with srcu critical section and then doing synchronize_srcu() at the end
      of quarantine_remove_cache().
      
      I've done some assessment of how good synchronize_srcu() works in this
      case.  And on a 4 CPU VM I see that it blocks waiting for pending read
      critical sections in about 2-3% of cases.  Which looks good to me.
      
      I suspect that these races are the root cause of some GPFs that I
      episodically hit.  Previously I did not have any explanation for them.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
        IP: qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        PGD 6aeea067
        PUD 60ed7067
        PMD 0
        Oops: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 0 PID: 13667 Comm: syz-executor2 Not tainted 4.10.0+ #60
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff88005f948040 task.stack: ffff880069818000
        RIP: 0010:qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        RSP: 0018:ffff88006981f298 EFLAGS: 00010246
        RAX: ffffea0000ffff00 RBX: 0000000000000000 RCX: ffffea0000ffff1f
        RDX: 0000000000000000 RSI: ffff88003fffc3e0 RDI: 0000000000000000
        RBP: ffff88006981f2c0 R08: ffff88002fed7bd8 R09: 00000001001f000d
        R10: 00000000001f000d R11: ffff88006981f000 R12: ffff88003fffc3e0
        R13: ffff88006981f2d0 R14: ffffffff81877fae R15: 0000000080000000
        FS:  00007fb911a2d700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000c8 CR3: 0000000060ed6000 CR4: 00000000000006f0
        Call Trace:
         quarantine_reduce+0x10e/0x120 mm/kasan/quarantine.c:239
         kasan_kmalloc+0xca/0xe0 mm/kasan/kasan.c:590
         kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
         slab_post_alloc_hook mm/slab.h:456 [inline]
         slab_alloc_node mm/slub.c:2718 [inline]
         kmem_cache_alloc_node+0x1d3/0x280 mm/slub.c:2754
         __alloc_skb+0x10f/0x770 net/core/skbuff.c:219
         alloc_skb include/linux/skbuff.h:932 [inline]
         _sctp_make_chunk+0x3b/0x260 net/sctp/sm_make_chunk.c:1388
         sctp_make_data net/sctp/sm_make_chunk.c:1420 [inline]
         sctp_make_datafrag_empty+0x208/0x360 net/sctp/sm_make_chunk.c:746
         sctp_datamsg_from_user+0x7e8/0x11d0 net/sctp/chunk.c:266
         sctp_sendmsg+0x2611/0x3970 net/sctp/socket.c:1962
         inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
         sock_sendmsg_nosec net/socket.c:633 [inline]
         sock_sendmsg+0xca/0x110 net/socket.c:643
         SYSC_sendto+0x660/0x810 net/socket.c:1685
         SyS_sendto+0x40/0x50 net/socket.c:1653
      
      I am not sure about backporting.  The bug is quite hard to trigger, I've
      seen it few times during our massive continuous testing (however, it
      could be cause of some other episodic stray crashes as it leads to
      memory corruption...).  If it is triggered, the consequences are very
      bad -- almost definite bad memory corruption.  The fix is non trivial
      and has chances of introducing new bugs.  I am also not sure how
      actively people use KASAN on older releases.
      
      [dvyukov@google.com: - sorted includes[
        Link: http://lkml.kernel.org/r/20170309094028.51088-1-dvyukov@google.com
      Link: http://lkml.kernel.org/r/20170308151532.5070-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce5bec54
    • D
      kasan: resched in quarantine_remove_cache() · 68fd814a
      Dmitry Vyukov 提交于
      We see reported stalls/lockups in quarantine_remove_cache() on machines
      with large amounts of RAM.  quarantine_remove_cache() needs to scan
      whole quarantine in order to take out all objects belonging to the
      cache.  Quarantine is currently 1/32-th of RAM, e.g.  on a machine with
      256GB of memory that will be 8GB.  Moreover quarantine scanning is a
      walk over uncached linked list, which is slow.
      
      Add cond_resched() after scanning of each non-empty batch of objects.
      Batches are specifically kept of reasonable size for quarantine_put().
      On a machine with 256GB of RAM we should have ~512 non-empty batches,
      each with 16MB of objects.
      
      Link: http://lkml.kernel.org/r/20170308154239.25440-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68fd814a
    • T
      mm: do not call mem_cgroup_free() from within mem_cgroup_alloc() · 40e952f9
      Tahsin Erdogan 提交于
      mem_cgroup_free() indirectly calls wb_domain_exit() which is not
      prepared to deal with a struct wb_domain object that hasn't executed
      wb_domain_init().  For instance, the following warning message is
      printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
      
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         register_lock_class+0x36d/0x540
         __lock_acquire+0x7f/0x1a30
         lock_acquire+0xcc/0x200
         del_timer_sync+0x3c/0xc0
         wb_domain_exit+0x14/0x20
         mem_cgroup_free+0x14/0x40
         mem_cgroup_css_alloc+0x3f9/0x620
         cgroup_apply_control_enable+0x190/0x390
         cgroup_mkdir+0x290/0x3d0
         kernfs_iop_mkdir+0x58/0x80
         vfs_mkdir+0x10e/0x1a0
         SyS_mkdirat+0xa8/0xd0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      
      Add __mem_cgroup_free() which skips wb_domain_exit().  This is used by
      both mem_cgroup_free() and mem_cgroup_alloc() clean up.
      
      Fixes: 0b8f73e1 ("mm: memcontrol: clean up alloc, online, offline, free functions")
      Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.comSigned-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e952f9
    • K
      thp: fix another corner case of munlock() vs. THPs · 6ebb4a1b
      Kirill A. Shutemov 提交于
      The following test case triggers BUG() in munlock_vma_pages_range():
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 4UL << 20);
      		mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 4096, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      The second mmap() create PTE-mapping of the first huge page in file.  It
      makes kernel munlock the page as we never keep PTE-mapped page mlocked.
      
      On munlockall() when we handle vma created by the first mmap(),
      munlock_vma_page() returns page_mask == 0, as the page is not mlocked
      anymore.  On next iteration follow_page_mask() return tail page, but
      page_mask is HPAGE_NR_PAGES - 1.  It makes us skip to the first tail
      page of the next huge page and step on
      VM_BUG_ON_PAGE(PageMlocked(page)).
      
      The fix is not use the page_mask from follow_page_mask() at all.  It has
      no use for us.
      
      Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>    [4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ebb4a1b
    • K
      rmap: fix NULL-pointer dereference on THP munlocking · 8346242a
      Kirill A. Shutemov 提交于
      The following test case triggers NULL-pointer derefernce in
      try_to_unmap_one():
      
      	#include <fcntl.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 2UL << 20);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      Apparently, there's a case when we call try_to_unmap() on huge PMDs:
      it's TTU_MUNLOCK.
      
      Let's handle this case correctly.
      
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Link: http://lkml.kernel.org/r/20170302151159.30592-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8346242a
    • A
      mm/memblock.c: fix memblock_next_valid_pfn() · c9a1b80d
      AKASHI Takahiro 提交于
      Obviously, we should not access memblock.memory.regions[right] if
      'right' is outside of [0..memblock.memory.cnt>.
      
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Link: http://lkml.kernel.org/r/20170303023745.9104-1-takahiro.akashi@linaro.orgSigned-off-by: NAKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9a1b80d
    • A
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli 提交于
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • L
      mm/cgroup: avoid panic when init with low memory · bfc7228b
      Laurent Dufour 提交于
      The system may panic when initialisation is done when almost all the
      memory is assigned to the huge pages using the kernel command line
      parameter hugepage=xxxx.  Panic may occur like this:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc000000000302b88
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 [    0.082424] NUMA
        pSeries
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
        task: c00000021ed01600 task.stack: c00000010d108000
        NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
        REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
        MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
        CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
        GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
        GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
        GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
        GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
        GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
        GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
        GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
        GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
        NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
        LR do_try_to_free_pages+0x1b4/0x450
        Call Trace:
          do_try_to_free_pages+0x1b4/0x450
          try_to_free_pages+0xf8/0x270
          __alloc_pages_nodemask+0x7a8/0xff0
          new_slab+0x104/0x8e0
          ___slab_alloc+0x620/0x700
          __slab_alloc+0x34/0x60
          kmem_cache_alloc_node_trace+0xdc/0x310
          mem_cgroup_init+0x158/0x1c8
          do_one_initcall+0x68/0x1d0
          kernel_init_freeable+0x278/0x360
          kernel_init+0x24/0x170
          ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
        3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
        ---[ end trace 342f5208b00d01b6 ]---
      
      This is a chicken and egg issue where the kernel try to get free memory
      when allocating per node data in mem_cgroup_init(), but in that path
      mem_cgroup_soft_limit_reclaim() is called which assumes that these data
      are allocated.
      
      As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
      these data are not yet allocated.
      
      This patch also fixes potential null pointer access in
      mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
      
      Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc7228b
    • Y
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie 提交于
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf
    • K
      mm: introduce __p4d_alloc() · 90eceff1
      Kirill A. Shutemov 提交于
      For full 5-level paging we need a helper to allocate p4d page table.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90eceff1
    • K
      mm: convert generic code to 5-level paging · c2febafc
      Kirill A. Shutemov 提交于
      Convert all non-architecture-specific code to 5-level paging.
      
      It's mostly mechanical adding handling one more page table level in
      places where we deal with pud_t.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2febafc
  5. 09 3月, 2017 3 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
    • J
      bdi: Fix use-after-free in wb_congested_put() · df23de55
      Jan Kara 提交于
      bdi_writeback_congested structures get created for each blkcg and bdi
      regardless whether bdi is registered or not. When they are created in
      unregistered bdi and the request queue (and thus bdi) is then destroyed
      while blkg still holds reference to bdi_writeback_congested structure,
      this structure will be referencing freed bdi and last wb_congested_put()
      will try to remove the structure from already freed bdi.
      
      With commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()", SCSI started to destroy bdis without calling
      bdi_unregister() first (previously it was calling bdi_unregister() even
      for unregistered bdis) and thus the code detaching
      bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
      started hitting this use-after-free bug. It is enough to boot a KVM
      instance with virtio-scsi device to trigger this behavior.
      
      Fix the problem by detaching bdi_writeback_congested structures in
      bdi_exit() instead of bdi_unregister(). This is also more logical as
      they can get attached to bdi regardless whether it ever got registered
      or not.
      
      Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      df23de55
    • J
      block: Allow bdi re-registration · b6f8fec4
      Jan Kara 提交于
      SCSI can call device_add_disk() several times for one request queue when
      a device in unbound and bound, creating new gendisk each time. This will
      lead to bdi being repeatedly registered and unregistered. This was not a
      big problem until commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()" since bdi was only registered repeatedly (bdi_register()
      handles repeated calls fine, only we ended up leaking reference to
      gendisk due to overwriting bdi->owner) but unregistered only in
      blk_cleanup_queue() which didn't get called repeatedly. After
      165a5e22 we were doing correct bdi_register() - bdi_unregister()
      cycles however bdi_unregister() is not prepared for it. So make sure
      bdi_unregister() cleans up bdi in such a way that it is prepared for
      a possible following bdi_register() call.
      
      An easy way to provoke this behavior is to enable
      CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
      scsi disk which immediately hangs without this fix.
      
      Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b6f8fec4
  6. 07 3月, 2017 2 次提交
  7. 03 3月, 2017 2 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
    • I
      sched/headers: Move task_struct::signal and task_struct::sighand types and... · c3edc401
      Ingo Molnar 提交于
      sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
      
      task_struct::signal and task_struct::sighand are pointers, which would normally make it
      straightforward to not define those types in sched.h.
      
      That is not so, because the types are accompanied by a myriad of APIs (macros and inline
      functions) that dereference them.
      
      Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.
      
      With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
      trying to put accessors into sched.h as a test fails the following way:
      
        ./include/linux/sched.h: In function ‘test_signal_types’:
        ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                          ^
      
      This reduces the size and complexity of sched.h significantly.
      
      Update all headers and .c code that relied on getting the signal handling
      functionality from <linux/sched.h> to include <linux/sched/signal.h>.
      
      The list of affected files in the preparatory patch was partly generated by
      grepping for the APIs, and partly by doing coverage build testing, both
      all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
      cross-architecture builds.
      
      Nevertheless some (trivial) build breakage is still expected related to rare
      Kconfig combinations and in-flight patches to various kernel code, but most
      of it should be handled by this patch.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c3edc401
  8. 02 3月, 2017 16 次提交