1. 06 3月, 2019 12 次提交
    • A
      mm/hugetlb: enable arch specific huge page size support for migration · e693de18
      Anshuman Khandual 提交于
      Architectures like arm64 have HugeTLB page sizes which are different
      than generic sizes at PMD, PUD, PGD level and implemented via contiguous
      bits.  At present these special size HugeTLB pages cannot be identified
      through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
      migrated.
      
      Enabling migration support for these special HugeTLB page sizes along
      with the generic ones (PMD|PUD|PGD) would require identifying all of
      them on a given platform.  A platform specific hook can precisely
      enumerate all huge page sizes supported for migration.  Instead of
      comparing against standard huge page orders let
      hugetlb_migration_support() function call a platform hook
      arch_hugetlb_migration_support().  Default definition for the platform
      hook maintains existing semantics which checks standard huge page order.
      But an architecture can choose to override the default and provide
      support for a comprehensive set of huge page sizes.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e693de18
    • A
      mm/hugetlb: enable PUD level huge page migration · 9b553bf5
      Anshuman Khandual 提交于
      Architectures like arm64 have PUD level HugeTLB pages for certain configs
      (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
      be enabled for migration.  It can be achieved through checking for
      PUD_SHIFT order based HugeTLB pages during migration.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b553bf5
    • A
      mm/hugetlb: distinguish between migratability and movability · 7ed2c31d
      Anshuman Khandual 提交于
      Patch series "arm64/mm: Enable HugeTLB migration", v4.
      
      This patch series enables HugeTLB migration support for all supported
      huge page sizes at all levels including contiguous bit implementation.
      Following HugeTLB migration support matrix has been enabled with this
      patch series.  All permutations have been tested except for the 16GB.
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
        4K:         64K     2M         32M     1G
        16K:         2M    32M          1G
        64K:         2M   512M         16G
      
      First the series adds migration support for PUD based huge pages.  It
      then adds a platform specific hook to query an architecture if a given
      huge page size is supported for migration while also providing a default
      fallback option preserving the existing semantics which just checks for
      (PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
      migration on arm64 and subscribe to this new platform specific hook by
      defining an override.
      
      The second patch differentiates between movability and migratability
      aspects of huge pages and implements hugepage_movable_supported() which
      can then be used during allocation to decide whether to place the huge
      page in movable zone or not.
      
      This patch (of 5):
      
      During huge page allocation it's migratability is checked to determine
      if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
      But the movability aspect of the huge page could depend on other factors
      than just migratability.  Movability in itself is a distinct property
      which should not be tied with migratability alone.
      
      This differentiates these two and implements an enhanced movability check
      which also considers huge page size to determine if it is feasible to be
      placed under a movable zone.  At present it just checks for gigantic pages
      but going forward it can incorporate other enhanced checks.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2c31d
    • M
      mm: remove sysctl_extfrag_handler() · 6b7e5cad
      Matthew Wilcox 提交于
      sysctl_extfrag_handler() neglects to propagate the return value from
      proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
      to exist, so just use proc_dointvec_minmax() directly.
      
      Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reported-by: NAditya Pakki <pakki001@umn.edu>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b7e5cad
    • S
      memcg: localize memcg_kmem_enabled() check · 60cd4bcd
      Shakeel Butt 提交于
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60cd4bcd
    • K
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai 提交于
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • D
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand 提交于
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
    • D
      mm: balloon: update comment about isolation/migration/compaction · 4d3467e1
      David Hildenbrand 提交于
      Patch series "mm/kdump: allow to exclude pages that are logically
      offline"
      
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      virtio-balloon, hv-balloon and VMWare balloon inflated memory will
      essentially result in zero pages getting allocated by the hypervisor and
      the dump getting filled with this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      Also for XEN, calling into the kernel and asking the hypervisor if a pfn
      is backed can be avoided if the duming tool would skip such pages right
      from the beginning.
      
      Dumping tools have no idea whether a given page is part of a balloon
      driver and shall not be dumped.  Esp.  PG_reserved cannot be used for
      that purpose as all memory allocated during early boot is also
      PG_reserved, see discussion at [1].  So some other way of indication is
      required and a new page flag is frowned upon.
      
      We have PG_balloon (MAPCOUNT value), which is essentially unused now.  I
      suggest renaming it to something more generic (PG_offline) to mark pages
      as logically offline.  This flag can than e.g.  also be used by
      virtio-mem in the future to mark subsections as offline.  Or by other
      code that wants to put pages logically offline (e.g.  later maybe
      poisoned pages that shall no longer be used).
      
      This series converts PG_balloon to PG_offline, allows dumping tools to
      query the value to detect such pages and marks pages in the hv-balloon
      and XEN balloon properly as PG_offline.  Note that virtio-balloon
      already set pages to PG_balloon (and now PG_offline).
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      As I don't have access to neither XEN nor Hyper-V nor VMWare
      installations, this was only tested with the virtio-balloon and pages
      were properly skipped when dumping.  I'll also attach the makedumpfile
      patch to this series.
      
      [1] https://lkml.org/lkml/2018/7/20/566
      
      This patch (of 8):
      
      Commit b1123ea6 ("mm: balloon: use general non-lru movable page
      feature") reworked balloon handling to make use of the general non-lru
      movable page feature.  The big comment block in balloon_compaction.h
      contains quite some outdated information.  Let's fix this.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d3467e1
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
    • T
      include/linux/slub_def.h: comment fixes · de810f49
      Tobin C. Harding 提交于
      Capitialize comment string, use C89 comment style, correct
      grammar/punctuation in comments.
      
      Link: http://lkml.kernel.org/r/20190204005713.9463-2-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-3-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-4-tobin@kernel.orgSigned-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de810f49
    • A
      kasan: fix kasan_check_read/write definitions · bcf6f55a
      Arnd Bergmann 提交于
      Building little-endian allmodconfig kernels on arm64 started failing
      with the generated atomic.h implementation, since we now try to call
      kasan helpers from the EFI stub:
      
        aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
        include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'
      
      I suspect that we get similar problems in other files that explicitly
      disable KASAN for some reason but call atomic_t based helper functions.
      
      We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
      that the compiler sets instead of checking CONFIG_KASAN, but this in
      turn requires a small hack in mm/kasan/common.c so we do see the extern
      declaration there instead of the inline function.
      
      Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
      Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reported-by: NAnders Roxell <anders.roxell@linaro.org>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf6f55a
  2. 05 3月, 2019 1 次提交
    • L
      aio: simplify - and fix - fget/fput for io_submit() · 84c4e1f8
      Linus Torvalds 提交于
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c4e1f8
  3. 04 3月, 2019 8 次提交
    • H
      net: phy: remove gen10g_no_soft_reset · 7be3ad84
      Heiner Kallweit 提交于
      genphy_no_soft_reset and gen10g_no_soft_reset are both the same no-ops,
      one is enough.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7be3ad84
    • H
      net: phy: don't export gen10g_read_status · d81210c2
      Heiner Kallweit 提交于
      gen10g_read_status is deprecated, therefore stop exporting it.
      We don't want to encourage anybody to use it.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d81210c2
    • H
      net: phy: remove gen10g_config_init · c5e91d39
      Heiner Kallweit 提交于
      ETHTOOL_LINK_MODE_10000baseT_Full_BIT is set anyway in the supported
      and advertising bitmap because it's part of PHY_10GBIT_FEATURES.
      And all users of gen10g_config_init use PHY_10GBIT_FEATURES.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5e91d39
    • H
      net: phy: remove gen10g_suspend and gen10g_resume · a6d0aa97
      Heiner Kallweit 提交于
      phy_suspend() and phy_resume() are no-ops anyway if no callback is
      defined. Therefore we don't need these stubs.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6d0aa97
    • F
      net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE · 9036b2fe
      Francesco Ruggeri 提交于
      By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
      receive all IPv6 RA packets from all namespaces.
      IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
      the socket to be only from the socket's namespace.
      Signed-off-by: NMaxim Martynov <maxim@arista.com>
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9036b2fe
    • A
      regulator: core: Add set/get_current_limit helpers for regmap users · a32e0c77
      Axel Lin 提交于
      By setting curr_table, n_current_limits, csel_reg and csel_mask, the
      regmap users can use regulator_set_current_limit_regmap and
      regulator_get_current_limit_regmap for set/get_current_limit callbacks.
      Signed-off-by: NAxel Lin <axel.lin@ingics.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      a32e0c77
    • A
      regulator: Fix comment for csel_reg and csel_mask · 35d838ff
      Axel Lin 提交于
      The csel_reg and csel_mask fields in struct regulator_desc needs to
      be generic for drivers. Not just for TPS65218.
      Signed-off-by: NAxel Lin <axel.lin@ingics.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      35d838ff
    • Y
      appletalk: Fix use-after-free in atalk_proc_exit · 6377f787
      YueHaibing 提交于
      KASAN report this:
      
      BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
      Read of size 8 at addr ffff8881f41fe5b0 by task syz-executor.0/2806
      
      CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xfa/0x1ce lib/dump_stack.c:113
       print_address_description+0x65/0x270 mm/kasan/report.c:187
       kasan_report+0x149/0x18d mm/kasan/report.c:317
       pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
       remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667
       atalk_proc_exit+0x18/0x820 [appletalk]
       atalk_exit+0xf/0x5a [appletalk]
       __do_sys_delete_module kernel/module.c:1018 [inline]
       __se_sys_delete_module kernel/module.c:961 [inline]
       __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fb2de6b9c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0
      RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb2de6ba6bc
      R13: 00000000004bccaa R14: 00000000006f6bc8 R15: 00000000ffffffff
      
      Allocated by task 2806:
       set_track mm/kasan/common.c:85 [inline]
       __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496
       slab_post_alloc_hook mm/slab.h:444 [inline]
       slab_alloc_node mm/slub.c:2739 [inline]
       slab_alloc mm/slub.c:2747 [inline]
       kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752
       kmem_cache_zalloc include/linux/slab.h:730 [inline]
       __proc_create+0x30f/0xa20 fs/proc/generic.c:408
       proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469
       0xffffffffc10c01bb
       0xffffffffc10c0166
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2806:
       set_track mm/kasan/common.c:85 [inline]
       __kasan_slab_free+0x130/0x180 mm/kasan/common.c:458
       slab_free_hook mm/slub.c:1409 [inline]
       slab_free_freelist_hook mm/slub.c:1436 [inline]
       slab_free mm/slub.c:2986 [inline]
       kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002
       pde_put+0x6e/0x80 fs/proc/generic.c:647
       remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684
       0xffffffffc10c031c
       0xffffffffc10c0166
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8881f41fe500
       which belongs to the cache proc_dir_entry of size 256
      The buggy address is located 176 bytes inside of
       256-byte region [ffff8881f41fe500, ffff8881f41fe600)
      The buggy address belongs to the page:
      page:ffffea0007d07f80 count:1 mapcount:0 mapping:ffff8881f6e69a00 index:0x0
      flags: 0x2fffc0000000200(slab)
      raw: 02fffc0000000200 dead000000000100 dead000000000200 ffff8881f6e69a00
      raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
       ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      It should check the return value of atalk_proc_init fails,
      otherwise atalk_exit will trgger use-after-free in pde_subdir_find
      while unload the module.This patch fix error cleanup path of atalk_init
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6377f787
  4. 02 3月, 2019 2 次提交
  5. 01 3月, 2019 1 次提交
  6. 28 2月, 2019 4 次提交
    • A
      mmc: core: Add discard support to sd · bc47e2f6
      Avri Altman 提交于
      SD spec v5.1 adds discard support. The flows and commands are similar to
      mmc, so just set the discard arg in CMD38.
      
      A host which supports DISCARD shall check if the DISCARD_SUPPORT (b313)
      is set in the SD_STATUS register.  If the card does not support discard,
      the host shall not issue DISCARD command, but ERASE command instead.
      
      Post the DISCARD operation, the card may de-allocate the discarded
      blocks partially or completely. So the host mustn't make any assumptions
      concerning the content of the discarded region. This is unlike ERASE
      command, in which the region is guaranteed to contain either '0's or
      '1's, depends on the content of DATA_STAT_AFTER_ERASE (b55) in the scr
      register.
      
      One more important difference compared to ERASE is the busy timeout
      which we will address on the next patch.
      Signed-off-by: NAvri Altman <avri.altman@wdc.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      bc47e2f6
    • F
      net: Remove switchdev_ops · 3d705f07
      Florian Fainelli 提交于
      Now that we have converted all possible callers to using a switchdev
      notifier for attributes we do not have a need for implementing
      switchdev_ops anymore, and this can be removed from all drivers the
      net_device structure.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d705f07
    • A
      net: dev: Use unsigned integer as an argument to left-shift · f4d7b3e2
      Andy Shevchenko 提交于
      1 << 31 is Undefined Behaviour according to the C standard.
      Use U type modifier to avoid theoretical overflow.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4d7b3e2
    • A
      bpf: enable program stats · 492ecee8
      Alexei Starovoitov 提交于
      JITed BPF programs are indistinguishable from kernel functions, but unlike
      kernel code BPF code can be changed often.
      Typical approach of "perf record" + "perf report" profiling and tuning of
      kernel code works just as well for BPF programs, but kernel code doesn't
      need to be monitored whereas BPF programs do.
      Users load and run large amount of BPF programs.
      These BPF stats allow tools monitor the usage of BPF on the server.
      The monitoring tools will turn sysctl kernel.bpf_stats_enabled
      on and off for few seconds to sample average cost of the programs.
      Aggregated data over hours and days will provide an insight into cost of BPF
      and alarms can trigger in case given program suddenly gets more expensive.
      
      The cost of two sched_clock() per program invocation adds ~20 nsec.
      Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
      from ~10 nsec to ~30 nsec.
      static_key minimizes the cost of the stats collection.
      There is no measurable difference before/after this patch
      with kernel.bpf_stats_enabled=0
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      492ecee8
  7. 27 2月, 2019 1 次提交
  8. 26 2月, 2019 1 次提交
    • L
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds 提交于
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  9. 25 2月, 2019 10 次提交