1. 22 2月, 2019 4 次提交
  2. 18 2月, 2019 1 次提交
  3. 16 2月, 2019 1 次提交
    • A
      arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table · 8a5b403d
      Ard Biesheuvel 提交于
      In the irqchip and EFI code, we have what basically amounts to a quirk
      to work around a peculiarity in the GICv3 architecture, which permits
      the system memory address of LPI tables to be programmable only once
      after a CPU reset. This means kexec kernels must use the same memory
      as the first kernel, and thus ensure that this memory has not been
      given out for other purposes by the time the ITS init code runs, which
      is not very early for secondary CPUs.
      
      On systems with many CPUs, these reservations could overflow the
      memblock reservation table, and this was addressed in commit:
      
        eff89628 ("efi/arm: Defer persistent reservations until after paging_init()")
      
      However, this turns out to have made things worse, since the allocation
      of page tables and heap space for the resized memblock reservation table
      itself may overwrite the regions we are attempting to reserve, which may
      cause all kinds of corruption, also considering that the ITS will still
      be poking bits into that memory in response to incoming MSIs.
      
      So instead, let's grow the static memblock reservation table on such
      systems so it can accommodate these reservations at an earlier time.
      This will permit us to revert the above commit in a subsequent patch.
      
      [ mingo: Minor cleanups. ]
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a5b403d
  4. 15 2月, 2019 1 次提交
    • J
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 2c2ade81
      Jann Horn 提交于
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c2ade81
  5. 13 2月, 2019 3 次提交
    • Q
      Revert "mm: use early_pfn_to_nid in page_ext_init" · 2f1ee091
      Qian Cai 提交于
      This reverts commit fe53ca54 ("mm: use early_pfn_to_nid in
      page_ext_init").
      
      When booting a system with "page_owner=on",
      
      start_kernel
        page_ext_init
          invoke_init_callbacks
            init_section_page_ext
              init_page_owner
                init_early_allocated_pages
                  init_zones_in_node
                    init_pages_in_zone
                      lookup_page_ext
                        page_to_nid
      
      The issue here is that page_to_nid() will not work since some page flags
      have no node information until later in page_alloc_init_late() due to
      DEFERRED_STRUCT_PAGE_INIT.  Hence, it could trigger an out-of-bounds
      access with an invalid nid.
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
        index 7 is out of range for type 'zone [5]'
      
      Also, kernel will panic since flags were poisoned earlier with,
      
      CONFIG_DEBUG_VM_PGFLAGS=y
      CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
      
      start_kernel
        setup_arch
          pagetable_init
            paging_init
              sparse_init
                sparse_init_nid
                  memblock_alloc_try_nid_raw
      
      It did not handle it well in init_pages_in_zone() which ends up calling
      page_to_nid().
      
        page:ffffea0004200000 is uninitialized and poisoned
        raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
        raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
        page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
        page_owner info is not active (free page?)
        kernel BUG at include/linux/mm.h:990!
        RIP: 0010:init_page_owner+0x486/0x520
      
      This means that assumptions behind commit fe53ca54 ("mm: use
      early_pfn_to_nid in page_ext_init") are incomplete.  Therefore, revert
      the commit for now.  A proper way to move the page_owner initialization
      to sooner is to hook into memmap initialization.
      
      Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f1ee091
    • Y
      mm/gup: fix gup_pmd_range() for dax · 414fd080
      Yu Zhao 提交于
      For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
      on x86.  So the function works as long as hugetlb is configured.
      However, dax doesn't depend on hugetlb.
      
      Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      414fd080
    • D
      Revert "mm: slowly shrink slabs with a relatively small number of objects" · a9a238e8
      Dave Chinner 提交于
      This reverts commit 172b06c3 ("mm: slowly shrink slabs with a
      relatively small number of objects").
      
      This change changes the agressiveness of shrinker reclaim, causing small
      cache and low priority reclaim to greatly increase scanning pressure on
      small caches.  As a result, light memory pressure has a disproportionate
      affect on small caches, and causes large caches to be reclaimed much
      faster than previously.
      
      As a result, it greatly perturbs the delicate balance of the VFS caches
      (dentry/inode vs file page cache) such that the inode/dentry caches are
      reclaimed much, much faster than the page cache and this drives us into
      several other caching imbalance related problems.
      
      As such, this is a bad change and needs to be reverted.
      
      [ Needs some massaging to retain the later seekless shrinker
        modifications.]
      
      Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
      Fixes: 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Wolfgang Walter <linux@stwm.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Spock <dairinin@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a238e8
  6. 02 2月, 2019 12 次提交
  7. 29 1月, 2019 1 次提交
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 4aa9fc2a
      Michal Hocko 提交于
      This reverts commit 2830bf6f.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4aa9fc2a
  8. 24 1月, 2019 1 次提交
    • L
      Revert "Change mincore() to count "mapped" pages rather than "cached" pages" · 30bac164
      Linus Torvalds 提交于
      This reverts commit 574823bf.
      
      It turns out that my hope that we could just remove the code that
      exposes the cache residency status from mincore() was too optimistic.
      
      There are various random users that want it, and one example would be
      the Netflix database cluster maintenance. To quote Josh Snyder:
      
       "For Netflix, losing accurate information from the mincore syscall
        would lengthen database cluster maintenance operations from days to
        months. We rely on cross-process mincore to migrate the contents of a
        page cache from machine to machine, and across reboots.
      
        To do this, I wrote and maintain happycache [1], a page cache
        dumper/loader tool. It is quite similar in architecture to pgfincore,
        except that it is agnostic to workload. The gist of happycache's
        operation is "produce a dump of residence status for each page, do
        some operation, then reload exactly the same pages which were present
        before." happycache is entirely dependent on accurate reporting of the
        in-core status of file-backed pages, as accessed by another process.
      
        We primarily use happycache with Cassandra, which (like Postgres +
        pgfincore) relies heavily on OS page cache to reduce disk accesses.
        Because our workloads never experience a cold page cache, we are able
        to provision hardware for a peak utilization level that is far lower
        than the hypothetical "every query is a cache miss" peak.
      
        A database warmed by happycache can be ready for service in seconds
        (bounded only by the performance of the drives and the I/O subsystem),
        with no period of in-service degradation. By contrast, putting a
        database in service without a page cache entails a potentially
        unbounded period of degradation (at Netflix, the time to populate a
        single node's cache via natural cache misses varies by workload from
        hours to weeks). If a single node upgrade were to take weeks, then
        upgrading an entire cluster would take months. Since we want to apply
        security upgrades (and other things) on a somewhat tighter schedule,
        we would have to develop more complex solutions to provide the same
        functionality already provided by mincore.
      
        At the bottom line, happycache is designed to benignly exploit the
        same information leak documented in the paper [2]. I think it makes
        perfect sense to remove cross-process mincore functionality from
        unprivileged users, but not to remove it entirely"
      
      We do have an alternate approach that limits the cache residency
      reporting only to processes that have write permissions to the file, so
      we can fix the original information leak issue that way.  It involves
      _adding_ code rather than removing it, which is sad, but hey, at least
      we haven't found any users that would find the restrictions
      unacceptable.
      
      So revert the optimistic first approach to make room for that alternate
      fix instead.
      Reported-by: NJosh Snyder <joshs@netflix.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Daniel Gruss <daniel@gruss.cc>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30bac164
  9. 23 1月, 2019 1 次提交
  10. 10 1月, 2019 1 次提交
  11. 09 1月, 2019 11 次提交
  12. 07 1月, 2019 1 次提交
    • L
      Change mincore() to count "mapped" pages rather than "cached" pages · 574823bf
      Linus Torvalds 提交于
      The semantics of what "in core" means for the mincore() system call are
      somewhat unclear, but Linux has always (since 2.3.52, which is when
      mincore() was initially done) treated it as "page is available in page
      cache" rather than "page is mapped in the mapping".
      
      The problem with that traditional semantic is that it exposes a lot of
      system cache state that it really probably shouldn't, and that users
      shouldn't really even care about.
      
      So let's try to avoid that information leak by simply changing the
      semantics to be that mincore() counts actual mapped pages, not pages
      that might be cheaply mapped if they were faulted (note the "might be"
      part of the old semantics: being in the cache doesn't actually guarantee
      that you can access them without IO anyway, since things like network
      filesystems may have to revalidate the cache before use).
      
      In many ways the old semantics were somewhat insane even aside from the
      information leak issue.  From the very beginning (and that beginning is
      a long time ago: 2.3.52 was released in March 2000, I think), the code
      had a comment saying
      
        Later we can get more picky about what "in core" means precisely.
      
      and this is that "later".  Admittedly it is much later than is really
      comfortable.
      
      NOTE! This is a real semantic change, and it is for example known to
      change the output of "fincore", since that program literally does a
      mmmap without populating it, and then doing "mincore()" on that mapping
      that doesn't actually have any pages in it.
      
      I'm hoping that nobody actually has any workflow that cares, and the
      info leak is real.
      
      We may have to do something different if it turns out that people have
      valid reasons to want the old semantics, and if we can limit the
      information leak sanely.
      
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Masatake YAMATO <yamato@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      574823bf
  13. 05 1月, 2019 2 次提交