1. 07 7月, 2017 17 次提交
    • J
      mm: memcontrol: per-lruvec stats infrastructure · 00f3ca2c
      Johannes Weiner 提交于
      lruvecs are at the intersection of the NUMA node and memcg, which is the
      scope for most paging activity.
      
      Introduce a convenient accounting infrastructure that maintains
      statistics per node, per memcg, and the lruvec itself.
      
      Then convert over accounting sites for statistics that are already
      tracked in both nodes and memcgs and can be easily switched.
      
      [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
        Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
      [hannes@cmpxchg.org: don't track uncharged pages at all
        Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
      [hannes@cmpxchg.org: add missing free_percpu()]
        Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
      [linux@roeck-us.net: hexagon: fix build error caused by include file order]
        Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
      Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00f3ca2c
    • P
      mm/hugetlb: allow architectures to override huge_pte_clear() · 9386fac3
      Punit Agrawal 提交于
      When unmapping a hugepage range, huge_pte_clear() is used to clear the
      page table entries that are marked as not present.  huge_pte_clear()
      internally just ends up calling pte_clear() which does not correctly
      deal with hugepages consisting of contiguous page table entries.
      
      Add a size argument to address this issue and allow architectures to
      override huge_pte_clear() by wrapping it in a #ifndef block.
      
      Update s390 implementation with the size parameter as well.
      
      Note that the change only affects huge_pte_clear() - the other generic
      hugetlb functions don't need any change.
      
      Link: http://lkml.kernel.org/r/20170522162555.4313-1-punit.agrawal@arm.comSigned-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	[s390 bits]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9386fac3
    • P
      mm/hugetlb: add size parameter to huge_pte_offset() · 7868a208
      Punit Agrawal 提交于
      A poisoned or migrated hugepage is stored as a swap entry in the page
      tables.  On architectures that support hugepages consisting of
      contiguous page table entries (such as on arm64) this leads to ambiguity
      in determining the page table entry to return in huge_pte_offset() when
      a poisoned entry is encountered.
      
      Let's remove the ambiguity by adding a size parameter to convey
      additional information about the requested address.  Also fixup the
      definition/usage of huge_pte_offset() throughout the tree.
      
      Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.comSigned-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
      Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7868a208
    • S
      arm64: hugetlb: remove spurious calls to huge_ptep_offset() · f0b38d65
      Steve Capper 提交于
      We don't need to call huge_ptep_offset as our accessors are already
      supplied with the pte_t *.  This patch removes those spurious calls.
      
      [punit.agrawal@arm.com: resolve rebase conflicts due to patch re-ordering]
      Link: http://lkml.kernel.org/r/20170524115409.31309-3-punit.agrawal@arm.comSigned-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: David Woods <dwoods@mellanox.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b38d65
    • S
      arm64: hugetlb: refactor find_num_contig() · bb9dd3df
      Steve Capper 提交于
      Patch series "Support for contiguous pte hugepages", v4.
      
      This patchset updates the hugetlb code to fix issues arising from
      contiguous pte hugepages (such as on arm64).  Compared to v3, This
      version addresses a build failure on arm64 by including two cleanup
      patches.  Other than the arm64 cleanups, the rest are generic code
      changes.  The remaining arm64 support based on these patches will be
      posted separately.  The patches are based on v4.12-rc2.  Previous
      related postings can be found at [0], [1], [2], and [3].
      
      The patches fall into three categories -
      
      * Patch 1-2 - arm64 cleanups required to greatly simplify changing
        huge_pte_offset() prototype in Patch 5.
      
        Catalin, Will - are you happy for these patches to go via mm?
      
      * Patches 3-4 address issues with gup
      
      * Patches 5-8 relate to passing a size argument to hugepage helpers to
        disambiguate the size of the referred page. These changes are
        required to enable arch code to properly handle swap entries for
        contiguous pte hugepages.
      
        The changes to huge_pte_offset() (patch 5) touch multiple
        architectures but I've managed to minimise these changes for the
        other affected functions - huge_pte_clear() and set_huge_pte_at().
      
      These patches gate the enabling of contiguous hugepages support on arm64
      which has been requested for systems using !4k page granule.
      
      The ARM64 architecture supports two flavours of hugepages -
      
      * Block mappings at the pud/pmd level
      
        These are regular hugepages where a pmd or a pud page table entry
        points to a block of memory. Depending on the PAGE_SIZE in use the
        following size of block mappings are supported -
      
                PMD	PUD
                ---	---
        4K:      2M	 1G
        16K:    32M
        64K:   512M
      
        For certain applications/usecases such as HPC and large enterprise
        workloads, folks are using 64k page size but the minimum hugepage size
        of 512MB isn't very practical.
      
      To overcome this ...
      
      * Using the Contiguous bit
      
        The architecture provides a contiguous bit in the translation table
        entry which acts as a hint to the mmu to indicate that it is one of a
        contiguous set of entries that can be cached in a single TLB entry.
      
        We use the contiguous bit in Linux to increase the mapping size at the
        pmd and pte (last) level.
      
        The number of supported contiguous entries varies by page size and
        level of the page table.
      
        Using the contiguous bit allows additional hugepage sizes -
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
          4K:         64K     2M         32M     1G
          16K:         2M    32M          1G
          64K:         2M   512M         16G
      
        Of these, 64K with 4K and 2M with 64K pages have been explicitly
        requested by a few different users.
      
      Entries with the contiguous bit set are required to be modified all
      together - which makes things like memory poisoning and migration
      impossible to do correctly without knowing the size of hugepage being
      dealt with - the reason for adding size parameter to a few of the
      hugepage helpers in this series.
      
      This patch (of 8):
      
      As we regularly check for contiguous pte's in the huge accessors, remove
      this extra check from find_num_contig.
      
      [punit.agrawal@arm.com: resolve rebase conflicts due to patch re-ordering]
      Link: http://lkml.kernel.org/r/20170524115409.31309-2-punit.agrawal@arm.comSigned-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: David Woods <dwoods@mellanox.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb9dd3df
    • A
      powerpc/mm/hugetlb: add support for 1G huge pages · 40692eb5
      Aneesh Kumar K.V 提交于
      POWER9 supports hugepages of size 2M and 1G in radix MMU mode.  This
      patch enables the usage of 1G page size for hugetlbfs.  This also update
      the helper such we can do 1G page allocation at runtime.
      
      We still don't enable 1G page size on DD1 version.  This is to avoid
      doing workaround mentioned in commit 6d3a0379 ("powerpc/mm: Add
      radix__tlb_flush_pte_p9_dd1()").
      
      Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40692eb5
    • A
      mm/hugetlb: clean up ARCH_HAS_GIGANTIC_PAGE · e1073d1e
      Aneesh Kumar K.V 提交于
      This moves the #ifdef in C code to a Kconfig dependency.  Also we move
      the gigantic_page_supported() function to be arch specific.
      
      This allows architectures to conditionally enable runtime allocation of
      gigantic huge page.  Architectures like ppc64 supports different
      gigantic huge page size (16G and 1G) based on the translation mode
      selected.  This provides an opportunity for ppc64 to enable runtime
      allocation only w.r.t 1G hugepage.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1073d1e
    • A
      powerpc/hugetlb: enable hugetlb migration for ppc64 · f7fb506f
      Aneesh Kumar K.V 提交于
      Link: http://lkml.kernel.org/r/1494926612-23928-10-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7fb506f
    • A
      powerpc/mm/hugetlb: remove follow_huge_addr for powerpc · 28c05716
      Aneesh Kumar K.V 提交于
      With generic code now handling hugetlb entries at pgd level and also
      supporting hugepage directory format, we can now remove the powerpc
      sepcific follow_huge_addr implementation.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-9-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28c05716
    • A
      powerpc/hugetlb: add follow_huge_pd implementation for ppc64 · 50791e6d
      Aneesh Kumar K.V 提交于
      Link: http://lkml.kernel.org/r/1494926612-23928-8-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50791e6d
    • M
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko 提交于
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • M
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko 提交于
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • M
      mm, memory_hotplug: get rid of is_zone_device_section · 1b862aec
      Michal Hocko 提交于
      Device memory hotplug hooks into regular memory hotplug only half way.
      It needs memory sections to track struct pages but there is no
      need/desire to associate those sections with memory blocks and export
      them to the userspace via sysfs because they cannot be onlined anyway.
      
      This is currently expressed by for_device argument to arch_add_memory
      which then makes sure to associate the given memory range with
      ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
      to distinguish special memory hotplug from the regular one.  While this
      works now, later patches in this series want to move __add_zone outside
      of arch_add_memory path so we have to come up with something else.
      
      Add want_memblock down the __add_pages path and use it to control
      whether the section->memblock association should be done.
      arch_add_memory then just trivially want memblock for everything but
      for_device hotplug.
      
      remove_memory_section doesn't need is_zone_device_section either.  We
      can simply skip all the memblock specific cleanup if there is no
      memblock for the given section.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b862aec
    • H
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying 提交于
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
    • L
      tile: provide default ioremap declaration · 39229200
      Logan Gunthorpe 提交于
      Add a default ioremap function which was not provided in all
      circumstances.  (Only when CONFIG_PCI and CONFIG_TILEGX was set).
      
      I have designs to use them in scatterlist.c where they'd likely never be
      called with this architecture, but it is needed to compile.  Thus, if
      the function is ever hit it returns NULL.
      
      Link: http://lkml.kernel.org/r/1495726904-27380-1-git-send-email-logang@deltatee.comSigned-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NStephen Bates <sbates@raithlin.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39229200
    • T
      mn10300: use generic fb.h · 9cfc5e04
      Tobias Klauser 提交于
      The mn10300 arch uses a verbatim copy of the asm-generic version and
      does not add any own implementations to the header, so use
      asm-generic/fb.h instead of duplicating code.
      
      Link: http://lkml.kernel.org/r/20170517083348.1815-1-tklauser@distanz.chSigned-off-by: NTobias Klauser <tklauser@distanz.ch>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cfc5e04
    • T
      mn10300: remove wrapper header for asm/device.h · dc513164
      Tobias Klauser 提交于
      mn10300's asm/device.h is merely including asm-generic/device.h.  Thus,
      the arch specific header can be omitted and the generic header can be
      used directly.
      
      Link: http://lkml.kernel.org/r/20170517124857.26834-1-tklauser@distanz.chSigned-off-by: NTobias Klauser <tklauser@distanz.ch>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc513164
  2. 06 7月, 2017 1 次提交
  3. 05 7月, 2017 2 次提交
  4. 04 7月, 2017 1 次提交
  5. 03 7月, 2017 13 次提交
    • T
      parisc: DMA API: return error instead of BUG_ON for dma ops on non dma devs · 33f9e024
      Thomas Bogendoerfer 提交于
      Enabling parport pc driver on a B2600 (and probably other 64bit PARISC
      systems) produced following BUG:
      
      CPU: 0 PID: 1 Comm: swapper Not tainted 4.12.0-rc5-30198-g1132d5e7 #156
      task: 000000009e050000 task.stack: 000000009e04c000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001101111111100001111 Not tainted
      r00-03  000000ff0806ff0f 000000009e04c990 0000000040871b78 000000009e04cac0
      r04-07  0000000040c14de0 ffffffffffffffff 000000009e07f098 000000009d82d200
      r08-11  000000009d82d210 0000000000000378 0000000000000000 0000000040c345e0
      r12-15  0000000000000005 0000000040c345e0 0000000000000000 0000000040c9d5e0
      r16-19  0000000040c345e0 00000000f00001c4 00000000f00001bc 0000000000000061
      r20-23  000000009e04ce28 0000000000000010 0000000000000010 0000000040b89e40
      r24-27  0000000000000003 0000000000ffffff 000000009d82d210 0000000040c14de0
      r28-31  0000000000000000 000000009e04ca90 000000009e04cb40 0000000000000000
      sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000404aece0 00000000404aece4
       IIR: 03ffe01f    ISR: 0000000010340000  IOR: 000001781304cac8
       CPU:        0   CR30: 000000009e04c000 CR31: 00000000e2976de2
       ORIG_R28: 0000000000000200
       IAOQ[0]: sba_dma_supported+0x80/0xd0
       IAOQ[1]: sba_dma_supported+0x84/0xd0
       RP(r2): parport_pc_probe_port+0x178/0x1200
      
      Cause is a call to dma_coerce_mask_and_coherenet in parport_pc_probe_port,
      which PARISC DMA API doesn't handle very nicely. This commit gives back
      DMA_ERROR_CODE for DMA API calls, if device isn't capable of DMA
      transaction.
      
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      33f9e024
    • M
      ARM64: dts: marvell: armada37xx: Fix timer interrupt specifiers · 88cda007
      Marc Zyngier 提交于
      Contrary to popular belief, PPIs connected to a GICv3 to not have
      an affinity field similar to that of GICv2. That is consistent
      with the fact that GICv3 is designed to accomodate thousands of
      CPUs, and fitting them as a bitmap in a byte is... difficult.
      
      Fixes: adbc3695 ("arm64: dts: add the Marvell Armada 3700 family and
      a development board")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      88cda007
    • P
      x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12 · 995f00a6
      Peter Feiner 提交于
      EPT A/D was enabled in the vmcs02 EPTP regardless of the vmcs12's EPTP
      value. The problem is that enabling A/D changes the behavior of L2's
      x86 page table walks as seen by L1. With A/D enabled, x86 page table
      walks are always treated as EPT writes.
      
      Commit ae1e2d10 ("kvm: nVMX: support EPT accessed/dirty bits",
      2017-03-30) tried to work around this problem by clearing the write
      bit in the exit qualification for EPT violations triggered by page
      walks.  However, that fixup introduced the opposite bug: page-table walks
      that actually set x86 A/D bits were *missing* the write bit in the exit
      qualification.
      
      This patch fixes the problem by disabling EPT A/D in the shadow MMU
      when EPT A/D is disabled in vmcs12's EPTP.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      995f00a6
    • A
      ARM: owl: smp: Drop bogus holding pen · 18cfd942
      Andreas Färber 提交于
      The S500 SoC can start secondary CPUs without busy-looping for pen_release,
      so simplify the SMP code compared to the LeMaker kernel tree.
      
      Fixes: 172067e0 ("ARM: owl: Implement CPU enable-method for S500")
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Cc: David Liu <liuwei@actions-semi.com>
      Signed-off-by: NAndreas Färber <afaerber@suse.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      18cfd942
    • A
      ARM: owl: Drop custom machine · eb382745
      Andreas Färber 提交于
      Rely on the fallback to "Generic DT based system".
      This change is visible in /proc/cpuinfo.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndreas Färber <afaerber@suse.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      eb382745
    • M
      x86/xen: allow userspace access during hypercalls · c54590ca
      Marek Marczykowski-Górecki 提交于
      Userspace application can do a hypercall through /dev/xen/privcmd, and
      some for some hypercalls argument is a pointers to user-provided
      structure. When SMAP is supported and enabled, hypervisor can't access.
      So, lets allow it.
      
      The same applies to HYPERVISOR_dm_op, where additionally privcmd driver
      carefully verify buffer addresses.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      c54590ca
    • G
      x86: xen: remove unnecessary variable in xen_foreach_remap_area() · bf1b9ddf
      Gustavo A. R. Silva 提交于
      Remove unnecessary variable mfn in function xen_foreach_remap_area() and,
      refactor the code.
      
      Variable mfn at line 518:mfn = xen_remap_buf.mfns[i];
      is only being used to store a value to be passed as
      an argument to the xen_update_mem_tables() function.
      This value can be passed directly, which makes variable
      mfn unnecessary. Also, value assigned to variable mfn
      at line 534:mfn = xen_remap_mfn; is never used.
      
      Addresses-Coverity-ID: 1260110
      Signed-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      bf1b9ddf
    • P
      kvm: x86: mmu: allow A/D bits to be disabled in an mmu · ac8d57e5
      Peter Feiner 提交于
      Adds the plumbing to disable A/D bits in the MMU based on a new role
      bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
      aren't available (i.e., using access tracking faults instead).
      
      To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
      place, A/D disablement is now stored in the SPTE. This state is stored
      in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
      tracking. Rather than just setting SPTE_SPECIAL_MASK when an
      access-tracking SPTE is non-present, we now always set
      SPTE_SPECIAL_MASK for access-tracking SPTEs.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      [Use role.ad_disabled even for direct (non-shadow) EPT page tables.  Add
       documentation and a few MMU_WARN_ONs. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d57e5
    • P
      x86: kvm: mmu: make spte mmio mask more explicit · dcdca5fe
      Peter Feiner 提交于
      Specify both a mask (i.e., bits to consider) and a value (i.e.,
      pattern of bits that indicates a special PTE) for mmio SPTEs. On
      Intel, this lets us pack even more information into the
      (SPTE_SPECIAL_MASK | EPT_VMX_RWX_MASK) mask we use for access
      tracking liberating all (SPTE_SPECIAL_MASK | (non-misconfigured-RWX))
      values.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dcdca5fe
    • P
      x86: kvm: mmu: dead code thanks to access tracking · ce00053b
      Peter Feiner 提交于
      The MMU always has hardware A bits or access tracking support, thus
      it's unnecessary to handle the scenario where we have neither.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce00053b
    • P
      KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code · 00c14757
      Paul Mackerras 提交于
      This fixes a typo where the wrong loop index was used to index
      the kvmppc_xive_vcpu.queues[] array in xive_pre_save_scan().
      The variable i contains the vcpu number; we need to index queues[]
      using j, which iterates from 0 to KVMPPC_XIVE_Q_COUNT-1.
      
      The effect of this bug is that things that save the interrupt
      controller state, such as "virsh dump", on a VM with more than
      8 vCPUs, result in xive_pre_save_queue() getting called on a
      bogus queue structure, usually resulting in a crash like this:
      
      [  501.821107] Unable to handle kernel paging request for data at address 0x00000084
      [  501.821212] Faulting instruction address: 0xc008000004c7c6f8
      [  501.821234] Oops: Kernel access of bad area, sig: 11 [#1]
      [  501.821305] SMP NR_CPUS=1024
      [  501.821307] NUMA
      [  501.821376] PowerNV
      [  501.821470] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel kvm_hv nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm tg3 ptp pps_core
      [  501.822477] CPU: 3 PID: 3934 Comm: live_migration Not tainted 4.11.0-4.git8caa70f.el7.centos.ppc64le #1
      [  501.822633] task: c0000003f9e3ae80 task.stack: c0000003f9ed4000
      [  501.822745] NIP: c008000004c7c6f8 LR: c008000004c7c628 CTR: 0000000030058018
      [  501.822877] REGS: c0000003f9ed7980 TRAP: 0300   Not tainted  (4.11.0-4.git8caa70f.el7.centos.ppc64le)
      [  501.823030] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  501.823047]   CR: 28022244  XER: 00000000
      [  501.823203] CFAR: c008000004c7c77c DAR: 0000000000000084 DSISR: 40000000 SOFTE: 1
      [  501.823203] GPR00: c008000004c7c628 c0000003f9ed7c00 c008000004c91450 00000000000000ff
      [  501.823203] GPR04: c0000003f5580000 c0000003f559bf98 9000000000009033 0000000000000000
      [  501.823203] GPR08: 0000000000000084 0000000000000000 00000000000001e0 9000000000001003
      [  501.823203] GPR12: c00000000008a7d0 c00000000fdc1b00 000000000a9a0000 0000000000000000
      [  501.823203] GPR16: 00000000402954e8 000000000a9a0000 0000000000000004 0000000000000000
      [  501.823203] GPR20: 0000000000000008 c000000002e8f180 c000000002e8f1e0 0000000000000001
      [  501.823203] GPR24: 0000000000000008 c0000003f5580008 c0000003f4564018 c000000002e8f1e8
      [  501.823203] GPR28: 00003ff6e58bdc28 c0000003f4564000 0000000000000000 0000000000000000
      [  501.825441] NIP [c008000004c7c6f8] xive_get_attr+0x3b8/0x5b0 [kvm]
      [  501.825671] LR [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm]
      [  501.825887] Call Trace:
      [  501.825991] [c0000003f9ed7c00] [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm] (unreliable)
      [  501.826312] [c0000003f9ed7cd0] [c008000004c62ec4] kvm_device_ioctl_attr+0x64/0xa0 [kvm]
      [  501.826581] [c0000003f9ed7d20] [c008000004c62fcc] kvm_device_ioctl+0xcc/0xf0 [kvm]
      [  501.826843] [c0000003f9ed7d40] [c000000000350c70] do_vfs_ioctl+0xd0/0x8c0
      [  501.827060] [c0000003f9ed7de0] [c000000000351534] SyS_ioctl+0xd4/0xf0
      [  501.827282] [c0000003f9ed7e30] [c00000000000b8e0] system_call+0x38/0xfc
      [  501.827496] Instruction dump:
      [  501.827632] 419e0078 3b760008 e9160008 83fb000c 83db0010 80fb0008 2f280000 60000000
      [  501.827901] 60000000 60420000 419a0050 7be91764 <7d284c2c> 552a0ffe 7f8af040 419e003c
      [  501.828176] ---[ end trace 2d0529a5bbbbafed ]---
      
      Cc: stable@vger.kernel.org
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00c14757
    • H
      parisc: Report SIGSEGV instead of SIGBUS when running out of stack · 24746231
      Helge Deller 提交于
      When a process runs out of stack the parisc kernel wrongly faults with SIGBUS
      instead of the expected SIGSEGV signal.
      
      This example shows how the kernel faults:
      do_page_fault() command='a.out' type=15 address=0xfaac2000 in libc-2.24.so[f8308000+16c000]
      trap #15: Data TLB miss fault, vm_start = 0xfa2c2000, vm_end = 0xfaac2000
      
      The vma->vm_end value is the first address which does not belong to the vma, so
      adjust the check to include vma->vm_end to the range for which to send the
      SIGSEGV signal.
      
      This patch unbreaks building the debian libsigsegv package.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHelge Deller <deller@gmx.de>
      24746231
    • E
      parisc: use compat_sys_keyctl() · b0f94efd
      Eric Biggers 提交于
      Architectures with a compat syscall table must put compat_sys_keyctl()
      in it, not sys_keyctl().  The parisc architecture was not doing this;
      fix it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      b0f94efd
  6. 02 7月, 2017 1 次提交
  7. 01 7月, 2017 5 次提交
    • P
      KVM: PPC: Book3S HV: Close race with testing for signals on guest entry · 8b24e69f
      Paul Mackerras 提交于
      At present, interrupts are hard-disabled fairly late in the guest
      entry path, in the assembly code.  Since we check for pending signals
      for the vCPU(s) task(s) earlier in the guest entry path, it is
      possible for a signal to be delivered before we enter the guest but
      not be noticed until after we exit the guest for some other reason.
      
      Similarly, it is possible for the scheduler to request a reschedule
      while we are in the guest entry path, and we won't notice until after
      we have run the guest, potentially for a whole timeslice.
      
      Furthermore, with a radix guest on POWER9, we can take the interrupt
      with the MMU on.  In this case we end up leaving interrupts
      hard-disabled after the guest exit, and they are likely to stay
      hard-disabled until we exit to userspace or context-switch to
      another process.  This was masking the fact that we were also not
      setting the RI (recoverable interrupt) bit in the MSR, meaning
      that if we had taken an interrupt, it would have crashed the host
      kernel with an unrecoverable interrupt message.
      
      To close these races, we need to check for signals and reschedule
      requests after hard-disabling interrupts, and then keep interrupts
      hard-disabled until we enter the guest.  If there is a signal or a
      reschedule request from another CPU, it will send an IPI, which will
      cause a guest exit.
      
      This puts the interrupt disabling before we call kvmppc_start_thread()
      for all the secondary threads of this core that are going to run vCPUs.
      The reason for that is that once we have started the secondary threads
      there is no easy way to back out without going through at least part
      of the guest entry path.  However, kvmppc_start_thread() includes some
      code for radix guests which needs to call smp_call_function(), which
      must be called with interrupts enabled.  To solve this problem, this
      patch moves that code into a separate function that is called earlier.
      
      When the guest exit is caused by an external interrupt, a hypervisor
      doorbell or a hypervisor maintenance interrupt, we now handle these
      using the replay facility.  __kvmppc_vcore_entry() now returns the
      trap number that caused the exit on this thread, and instead of the
      assembly code jumping to the handler entry, we return to C code with
      interrupts still hard-disabled and set the irq_happened flag in the
      PACA, so that when we do local_irq_enable() the appropriate handler
      gets called.
      
      With all this, we now have the interrupt soft-enable flag clear while
      we are in the guest.  This is useful because code in the real-mode
      hypercall handlers that checks whether interrupts are enabled will
      now see that they are disabled, which is correct, since interrupts
      are hard-disabled in the real-mode code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b24e69f
    • P
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras 提交于
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      898b25b2
    • V
      x86/intel_rdt: Fix memory leak on mount failure · 79298acc
      Vikas Shivappa 提交于
      If mount fails, the kn_info directory is not freed causing memory leak.
      
      Add the missing error handling path.
      
      Fixes: 4e978d06 ("x86/intel_rdt: Add "info" files to resctrl file system")
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: ravi.v.shankar@intel.com
      Cc: tony.luck@intel.com
      Cc: fenghua.yu@intel.com
      Cc: peterz@infradead.org
      Cc: vikas.shivappa@intel.com
      Cc: andi.kleen@intel.com
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1498503368-20173-3-git-send-email-vikas.shivappa@linux.intel.com
      79298acc
    • A
      ARM: Prepare for randomized task_struct · ffa47aa6
      Arnd Bergmann 提交于
      With the new task struct randomization, we can run into a build
      failure for certain random seeds, which will place fields beyond
      the allow immediate size in the assembly:
      
      arch/arm/kernel/entry-armv.S: Assembler messages:
      arch/arm/kernel/entry-armv.S:803: Error: bad immediate value for offset (4096)
      
      Only two constants in asm-offset.h are affected, and I'm changing
      both of them here to work correctly in all configurations.
      
      One more macro has the problem, but is currently unused, so this
      removes it instead of adding complexity.
      Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [kees: Adjust commit log slightly]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      ffa47aa6
    • V
      ARM: dma-mapping: Remove traces of NOMMU code · 1655cf88
      Vladimir Murzin 提交于
      DMA operations for NOMMU case have been just factored out into
      separate compilation unit, so don't keep dead code.
      Tested-by: NBenjamin Gaignard <benjamin.gaignard@linaro.org>
      Tested-by: NAndras Szemzo <sza@esh.hu>
      Tested-by: NAlexandre TORGUE <alexandre.torgue@st.com>
      Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      1655cf88