1. 06 3月, 2019 26 次提交
    • Y
      mm: swap: use mem_cgroup_is_root() instead of deferencing css->parent · 59118c42
      Yang Shi 提交于
      mem_cgroup_is_root() is the preferred API to check if memcg is root or
      not.  Use it instead of deferencing css->parent.
      
      Link: http://lkml.kernel.org/r/1547232913-118148-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59118c42
    • A
      mm: update get_user_pages_longterm to migrate pages allocated from CMA region · 9a4e9f3b
      Aneesh Kumar K.V 提交于
      This patch updates get_user_pages_longterm to migrate pages allocated
      out of CMA region.  This makes sure that we don't keep non-movable pages
      (due to page reference count) in the CMA area.
      
      This will be used by ppc64 in a later patch to avoid pinning pages in
      the CMA region.  ppc64 uses CMA region for allocation of the hardware
      page table (hash page table) and not able to migrate pages out of CMA
      region results in page table allocation failures.
      
      One case where we hit this easy is when a guest using a VFIO passthrough
      device.  VFIO locks all the guest's memory and if the guest memory is
      backed by CMA region, it becomes unmovable resulting in fragmenting the
      CMA and possibly preventing other guests from allocation a large enough
      hash page table.
      
      NOTE: We allocate the new page without using __GFP_THISNODE
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a4e9f3b
    • A
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V 提交于
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8
    • D
      mm: better document PG_reserved · 6e2e07cd
      David Hildenbrand 提交于
      The usage of PG_reserved and how PG_reserved pages are to be treated is
      buried deep down in different parts of the kernel.  Let's shine some
      light onto these details by documenting current users and expected
      behavior.
      
      Especially, clarify on the "Some of them might not even exist" case.
      These are physical memory gaps that will never be dumped as they are not
      marked as IORESOURCE_SYSRAM.  PG_reserved does in general not hinder
      anybody from dumping or swapping.  In some cases, these pages will not
      be stored in the hibernation image.
      
      Link: http://lkml.kernel.org/r/20190114125903.24845-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e2e07cd
    • V
      mm: rid swapoff of quadratic complexity · b56a2d8a
      Vineeth Remanan Pillai 提交于
      This patch was initially posted by Kelley Nielsen.  Reposting the patch
      with all review comments addressed and with minor modifications and
      optimizations.  Also, folding in the fixes offered by Hugh Dickins and
      Huang Ying.  Tests were rerun and commit message updated with new
      results.
      
      try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
      It unuses swap entries one by one, potentially iterating over all the
      page tables for all the processes in the system for each one.
      
      This new proposed implementation of try_to_unuse simplifies its
      complexity to linear.  It iterates over the system's mms once, unusing
      all the affected entries as it walks each set of page tables.  It also
      makes similar changes to shmem_unuse.
      
      Improvement
      
      swapoff was called on a swap partition containing about 6G of data, in a
      VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.
      
      Present implementation....about 1200M calls(8min, avg 80% cpu util).
      Prototype.................about  9.0K calls(3min, avg 5% cpu util).
      
      Details
      
      In shmem_unuse(), iterate over the shmem_swaplist and, for each
      shmem_inode_info that contains a swap entry, pass it to
      shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
      iterate over its associated xarray, and store the index and value of
      each swap entry in an array for passing to shmem_swapin_page() outside
      of the RCU critical section.
      
      In try_to_unuse(), instead of iterating over the entries in the type and
      unusing them one by one, perhaps walking all the page tables for all the
      processes for each one, iterate over the mmlist, making one pass.  Pass
      each mm to unuse_mm() to begin its page table walk, and during the walk,
      unuse all the ptes that have backing store in the swap type received by
      try_to_unuse().  After the walk, check the type for orphaned swap
      entries with find_next_to_unuse(), and remove them from the swap cache.
      If find_next_to_unuse() starts over at the beginning of the type, repeat
      the check of the shmem_swaplist and the walk a maximum of three times.
      
      Change unuse_mm() and the intervening walk functions down to
      unuse_pte_range() to take the type as a parameter, and to iterate over
      their entire range, calling the next function down on every iteration.
      In unuse_pte_range(), make a swap entry from each pte in the range using
      the passed in type.  If it has backing store in the type, call
      swapin_readahead() to retrieve the page and pass it to unuse_pte().
      
      Pass the count of pages_to_unuse down the page table walks in
      try_to_unuse(), and return from the walk when the desired number of
      pages has been swapped back in.
      
      Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.comSigned-off-by: NVineeth Remanan Pillai <vpillai@digitalocean.com>
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b56a2d8a
    • A
      mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update · 023bdd00
      Aneesh Kumar K.V 提交于
      Architectures like ppc64 require to do a conditional tlb flush based on
      the old and new value of pte.  Follow the regular pte change protection
      sequence for hugetlb too.  This allows the architectures to override the
      update sequence.
      
      Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023bdd00
    • W
    • C
      mm, memcg: create mem_cgroup_from_seq · aa9694bb
      Chris Down 提交于
      This is the start of a series of patches similar to my earlier
      DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).
      
      There are a bunch of places we go from seq_file to mem_cgroup, which
      currently requires manually getting the css, then getting the mem_cgroup
      from the css.  It's in enough places now that having mem_cgroup_from_seq
      makes sense (and also makes the next patch a bit nicer).
      
      Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa9694bb
    • J
      kernel: cgroup: add poll file operation · dc50537b
      Johannes Weiner 提交于
      Cgroup has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.comSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc50537b
    • J
      fs: kernfs: add poll file operation · 147e1a97
      Johannes Weiner 提交于
      Patch series "psi: pressure stall monitors", v3.
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.
      
      In our memory stress testing psi memory monitors produce roughly 10x
      less false positives compared to vmpressure signals.  Having ability to
      specify multiple triggers for the same psi metric allows other parts of
      Android framework to monitor memory state of the device and act
      accordingly.
      
      The new interface is straightforward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000";
              fd = open("/proc/pressure/memory");
              write(fd, trigger, sizeof(trigger));
              while (poll() >= 0) {
                      ...
              }
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-4 prepare the psi code for polling support.  Patch 5
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 5):
      
      Kernfs has a standardized poll/notification mechanism for waking all
      pollers on all fds when a filesystem node changes.  To allow polling for
      custom events, add a .poll callback that can override the default.
      
      This is in preparation for pollable cgroup pressure files which have
      per-fd trigger configurations.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.comSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      147e1a97
    • M
      mm, compaction: capture a page under direct compaction · 5e1f0f09
      Mel Gorman 提交于
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e1f0f09
    • M
      mm, compaction: be selective about what pageblocks to clear skip hints · e332f741
      Mel Gorman 提交于
      Pageblock hints are cleared when compaction restarts or kswapd makes
      enough progress that it can sleep but it's over-eager in that the bit is
      cleared for migration sources with no LRU pages and migration targets
      with no free pages.  As pageblock skip hint flushes are relatively rare
      and out-of-band with respect to kswapd, this patch makes a few more
      expensive checks to see if it's appropriate to even clear the bit.
      Every pageblock that is not cleared will avoid 512 pages being scanned
      unnecessarily on x86-64.
      
      The impact is variable with different workloads showing small
      differences in latency, success rates and scan rates.  This is expected
      as clearing the hints is not that common but doing a small amount of
      work out-of-band to avoid a large amount of work in-band later is
      generally a good thing.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      [cai@lca.pw: no stuck in __reset_isolation_pfn()]
        Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pwSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e332f741
    • M
      mm, compaction: use free lists to quickly locate a migration source · 70b44595
      Mel Gorman 提交于
      The migration scanner is a linear scan of a zone with a potentiall large
      search space.  Furthermore, many pageblocks are unusable such as those
      filled with reserved pages or partially filled with pages that cannot
      migrate.  These still get scanned in the common case of allocating a THP
      and the cost accumulates.
      
      The patch uses a partial search of the free lists to locate a migration
      source candidate that is marked as MOVABLE when allocating a THP.  It
      prefers picking a block with a larger number of free pages already on
      the basis that there are fewer pages to migrate to free the entire
      block.  The lowest PFN found during searches is tracked as the basis of
      the start for the linear search after the first search of the free list
      fails.  After the search, the free list is shuffled so that the next
      search will not encounter the same page.  If the search fails then the
      subsequent searches will be shorter and the linear scanner is used.
      
      If this search fails, or if the request is for a small or
      unmovable/reclaimable allocation then the linear scanner is still used.
      It is somewhat pointless to use the list search in those cases.  Small
      free pages must be used for the search and there is no guarantee that
      movable pages are located within that block that are contiguous.
      
                                           5.0.0-rc1              5.0.0-rc1
                                       noboost-v3r10          findmig-v3r15
      Amean     fault-both-3      3771.41 (   0.00%)     3390.40 (  10.10%)
      Amean     fault-both-5      5409.05 (   0.00%)     5082.28 (   6.04%)
      Amean     fault-both-7      7040.74 (   0.00%)     7012.51 (   0.40%)
      Amean     fault-both-12    11887.35 (   0.00%)    11346.63 (   4.55%)
      Amean     fault-both-18    16718.19 (   0.00%)    15324.19 (   8.34%)
      Amean     fault-both-24    21157.19 (   0.00%)    16088.50 *  23.96%*
      Amean     fault-both-30    21175.92 (   0.00%)    18723.42 *  11.58%*
      Amean     fault-both-32    21339.03 (   0.00%)    18612.01 *  12.78%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                  noboost-v3r10          findmig-v3r15
      Percentage huge-3        86.50 (   0.00%)       89.83 (   3.85%)
      Percentage huge-5        92.52 (   0.00%)       91.96 (  -0.61%)
      Percentage huge-7        92.44 (   0.00%)       92.85 (   0.44%)
      Percentage huge-12       92.98 (   0.00%)       92.74 (  -0.25%)
      Percentage huge-18       91.70 (   0.00%)       91.71 (   0.02%)
      Percentage huge-24       91.59 (   0.00%)       92.13 (   0.60%)
      Percentage huge-30       90.14 (   0.00%)       93.79 (   4.04%)
      Percentage huge-32       90.03 (   0.00%)       91.27 (   1.37%)
      
      This shows an improvement in allocation latencies with similar
      allocation success rates.  While not presented, there was a 31%
      reduction in migration scanning and a 8% reduction on system CPU usage.
      A 2-socket machine showed similar benefits.
      
      [mgorman@techsingularity.net: several fixes]
        Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
      [vbabka@suse.cz: migrate block that was found-fast, some optimisations]
      Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <Vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b44595
    • A
      mm: shuffle GFP_* flags · d71e53ce
      Alexey Dobriyan 提交于
      GFP_KERNEL is one of the most used constant but on archs like arm with
      fixed length instruction some constants are more equal than the others.
      Constants with tightly packed bits can be injected directly into
      instruction stream:
      
      	   0:   e3a00d33        mov     r0, #3264       ; 0xcc0
      
      Others require multiple instructions or even loading out of instruction
      stream:
      
      	   0:   e3a000c0        mov     r0, #192        ; 0xc0
      	   4:   e3400060        movt    r0, #96		; 0x60
      
      Shuffle GFP_* flags so that GFP_KERNEL/GFP_ATOMIC + __GFP_ZERO bits are
      close to each other.
      
      Savings on arm configs are ~0.1%.
      
      Link: http://lkml.kernel.org/r/20190109201838.GA9140@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d71e53ce
    • A
      mm/hugetlb: enable arch specific huge page size support for migration · e693de18
      Anshuman Khandual 提交于
      Architectures like arm64 have HugeTLB page sizes which are different
      than generic sizes at PMD, PUD, PGD level and implemented via contiguous
      bits.  At present these special size HugeTLB pages cannot be identified
      through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
      migrated.
      
      Enabling migration support for these special HugeTLB page sizes along
      with the generic ones (PMD|PUD|PGD) would require identifying all of
      them on a given platform.  A platform specific hook can precisely
      enumerate all huge page sizes supported for migration.  Instead of
      comparing against standard huge page orders let
      hugetlb_migration_support() function call a platform hook
      arch_hugetlb_migration_support().  Default definition for the platform
      hook maintains existing semantics which checks standard huge page order.
      But an architecture can choose to override the default and provide
      support for a comprehensive set of huge page sizes.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e693de18
    • A
      mm/hugetlb: enable PUD level huge page migration · 9b553bf5
      Anshuman Khandual 提交于
      Architectures like arm64 have PUD level HugeTLB pages for certain configs
      (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
      be enabled for migration.  It can be achieved through checking for
      PUD_SHIFT order based HugeTLB pages during migration.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b553bf5
    • A
      mm/hugetlb: distinguish between migratability and movability · 7ed2c31d
      Anshuman Khandual 提交于
      Patch series "arm64/mm: Enable HugeTLB migration", v4.
      
      This patch series enables HugeTLB migration support for all supported
      huge page sizes at all levels including contiguous bit implementation.
      Following HugeTLB migration support matrix has been enabled with this
      patch series.  All permutations have been tested except for the 16GB.
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
        4K:         64K     2M         32M     1G
        16K:         2M    32M          1G
        64K:         2M   512M         16G
      
      First the series adds migration support for PUD based huge pages.  It
      then adds a platform specific hook to query an architecture if a given
      huge page size is supported for migration while also providing a default
      fallback option preserving the existing semantics which just checks for
      (PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
      migration on arm64 and subscribe to this new platform specific hook by
      defining an override.
      
      The second patch differentiates between movability and migratability
      aspects of huge pages and implements hugepage_movable_supported() which
      can then be used during allocation to decide whether to place the huge
      page in movable zone or not.
      
      This patch (of 5):
      
      During huge page allocation it's migratability is checked to determine
      if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
      But the movability aspect of the huge page could depend on other factors
      than just migratability.  Movability in itself is a distinct property
      which should not be tied with migratability alone.
      
      This differentiates these two and implements an enhanced movability check
      which also considers huge page size to determine if it is feasible to be
      placed under a movable zone.  At present it just checks for gigantic pages
      but going forward it can incorporate other enhanced checks.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2c31d
    • M
      mm: remove sysctl_extfrag_handler() · 6b7e5cad
      Matthew Wilcox 提交于
      sysctl_extfrag_handler() neglects to propagate the return value from
      proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
      to exist, so just use proc_dointvec_minmax() directly.
      
      Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reported-by: NAditya Pakki <pakki001@umn.edu>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b7e5cad
    • S
      memcg: localize memcg_kmem_enabled() check · 60cd4bcd
      Shakeel Butt 提交于
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60cd4bcd
    • K
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai 提交于
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • D
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand 提交于
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
    • D
      mm: balloon: update comment about isolation/migration/compaction · 4d3467e1
      David Hildenbrand 提交于
      Patch series "mm/kdump: allow to exclude pages that are logically
      offline"
      
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      virtio-balloon, hv-balloon and VMWare balloon inflated memory will
      essentially result in zero pages getting allocated by the hypervisor and
      the dump getting filled with this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      Also for XEN, calling into the kernel and asking the hypervisor if a pfn
      is backed can be avoided if the duming tool would skip such pages right
      from the beginning.
      
      Dumping tools have no idea whether a given page is part of a balloon
      driver and shall not be dumped.  Esp.  PG_reserved cannot be used for
      that purpose as all memory allocated during early boot is also
      PG_reserved, see discussion at [1].  So some other way of indication is
      required and a new page flag is frowned upon.
      
      We have PG_balloon (MAPCOUNT value), which is essentially unused now.  I
      suggest renaming it to something more generic (PG_offline) to mark pages
      as logically offline.  This flag can than e.g.  also be used by
      virtio-mem in the future to mark subsections as offline.  Or by other
      code that wants to put pages logically offline (e.g.  later maybe
      poisoned pages that shall no longer be used).
      
      This series converts PG_balloon to PG_offline, allows dumping tools to
      query the value to detect such pages and marks pages in the hv-balloon
      and XEN balloon properly as PG_offline.  Note that virtio-balloon
      already set pages to PG_balloon (and now PG_offline).
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      As I don't have access to neither XEN nor Hyper-V nor VMWare
      installations, this was only tested with the virtio-balloon and pages
      were properly skipped when dumping.  I'll also attach the makedumpfile
      patch to this series.
      
      [1] https://lkml.org/lkml/2018/7/20/566
      
      This patch (of 8):
      
      Commit b1123ea6 ("mm: balloon: use general non-lru movable page
      feature") reworked balloon handling to make use of the general non-lru
      movable page feature.  The big comment block in balloon_compaction.h
      contains quite some outdated information.  Let's fix this.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d3467e1
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
    • T
      include/linux/slub_def.h: comment fixes · de810f49
      Tobin C. Harding 提交于
      Capitialize comment string, use C89 comment style, correct
      grammar/punctuation in comments.
      
      Link: http://lkml.kernel.org/r/20190204005713.9463-2-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-3-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-4-tobin@kernel.orgSigned-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de810f49
    • A
      kasan: fix kasan_check_read/write definitions · bcf6f55a
      Arnd Bergmann 提交于
      Building little-endian allmodconfig kernels on arm64 started failing
      with the generated atomic.h implementation, since we now try to call
      kasan helpers from the EFI stub:
      
        aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
        include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'
      
      I suspect that we get similar problems in other files that explicitly
      disable KASAN for some reason but call atomic_t based helper functions.
      
      We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
      that the compiler sets instead of checking CONFIG_KASAN, but this in
      turn requires a small hack in mm/kasan/common.c so we do see the extern
      declaration there instead of the inline function.
      
      Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
      Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reported-by: NAnders Roxell <anders.roxell@linaro.org>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf6f55a
  2. 05 3月, 2019 1 次提交
    • L
      aio: simplify - and fix - fget/fput for io_submit() · 84c4e1f8
      Linus Torvalds 提交于
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c4e1f8
  3. 04 3月, 2019 8 次提交
    • H
      net: phy: remove gen10g_no_soft_reset · 7be3ad84
      Heiner Kallweit 提交于
      genphy_no_soft_reset and gen10g_no_soft_reset are both the same no-ops,
      one is enough.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7be3ad84
    • H
      net: phy: don't export gen10g_read_status · d81210c2
      Heiner Kallweit 提交于
      gen10g_read_status is deprecated, therefore stop exporting it.
      We don't want to encourage anybody to use it.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d81210c2
    • H
      net: phy: remove gen10g_config_init · c5e91d39
      Heiner Kallweit 提交于
      ETHTOOL_LINK_MODE_10000baseT_Full_BIT is set anyway in the supported
      and advertising bitmap because it's part of PHY_10GBIT_FEATURES.
      And all users of gen10g_config_init use PHY_10GBIT_FEATURES.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5e91d39
    • H
      net: phy: remove gen10g_suspend and gen10g_resume · a6d0aa97
      Heiner Kallweit 提交于
      phy_suspend() and phy_resume() are no-ops anyway if no callback is
      defined. Therefore we don't need these stubs.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6d0aa97
    • F
      net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE · 9036b2fe
      Francesco Ruggeri 提交于
      By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
      receive all IPv6 RA packets from all namespaces.
      IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
      the socket to be only from the socket's namespace.
      Signed-off-by: NMaxim Martynov <maxim@arista.com>
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9036b2fe
    • A
      regulator: core: Add set/get_current_limit helpers for regmap users · a32e0c77
      Axel Lin 提交于
      By setting curr_table, n_current_limits, csel_reg and csel_mask, the
      regmap users can use regulator_set_current_limit_regmap and
      regulator_get_current_limit_regmap for set/get_current_limit callbacks.
      Signed-off-by: NAxel Lin <axel.lin@ingics.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      a32e0c77
    • A
      regulator: Fix comment for csel_reg and csel_mask · 35d838ff
      Axel Lin 提交于
      The csel_reg and csel_mask fields in struct regulator_desc needs to
      be generic for drivers. Not just for TPS65218.
      Signed-off-by: NAxel Lin <axel.lin@ingics.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      35d838ff
    • Y
      appletalk: Fix use-after-free in atalk_proc_exit · 6377f787
      YueHaibing 提交于
      KASAN report this:
      
      BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
      Read of size 8 at addr ffff8881f41fe5b0 by task syz-executor.0/2806
      
      CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xfa/0x1ce lib/dump_stack.c:113
       print_address_description+0x65/0x270 mm/kasan/report.c:187
       kasan_report+0x149/0x18d mm/kasan/report.c:317
       pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
       remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667
       atalk_proc_exit+0x18/0x820 [appletalk]
       atalk_exit+0xf/0x5a [appletalk]
       __do_sys_delete_module kernel/module.c:1018 [inline]
       __se_sys_delete_module kernel/module.c:961 [inline]
       __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fb2de6b9c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0
      RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb2de6ba6bc
      R13: 00000000004bccaa R14: 00000000006f6bc8 R15: 00000000ffffffff
      
      Allocated by task 2806:
       set_track mm/kasan/common.c:85 [inline]
       __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496
       slab_post_alloc_hook mm/slab.h:444 [inline]
       slab_alloc_node mm/slub.c:2739 [inline]
       slab_alloc mm/slub.c:2747 [inline]
       kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752
       kmem_cache_zalloc include/linux/slab.h:730 [inline]
       __proc_create+0x30f/0xa20 fs/proc/generic.c:408
       proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469
       0xffffffffc10c01bb
       0xffffffffc10c0166
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2806:
       set_track mm/kasan/common.c:85 [inline]
       __kasan_slab_free+0x130/0x180 mm/kasan/common.c:458
       slab_free_hook mm/slub.c:1409 [inline]
       slab_free_freelist_hook mm/slub.c:1436 [inline]
       slab_free mm/slub.c:2986 [inline]
       kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002
       pde_put+0x6e/0x80 fs/proc/generic.c:647
       remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684
       0xffffffffc10c031c
       0xffffffffc10c0166
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8881f41fe500
       which belongs to the cache proc_dir_entry of size 256
      The buggy address is located 176 bytes inside of
       256-byte region [ffff8881f41fe500, ffff8881f41fe600)
      The buggy address belongs to the page:
      page:ffffea0007d07f80 count:1 mapcount:0 mapping:ffff8881f6e69a00 index:0x0
      flags: 0x2fffc0000000200(slab)
      raw: 02fffc0000000200 dead000000000100 dead000000000200 ffff8881f6e69a00
      raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
       ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      It should check the return value of atalk_proc_init fails,
      otherwise atalk_exit will trgger use-after-free in pde_subdir_find
      while unload the module.This patch fix error cleanup path of atalk_init
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6377f787
  4. 02 3月, 2019 2 次提交
  5. 01 3月, 2019 1 次提交
  6. 28 2月, 2019 2 次提交
    • A
      mmc: core: Add discard support to sd · bc47e2f6
      Avri Altman 提交于
      SD spec v5.1 adds discard support. The flows and commands are similar to
      mmc, so just set the discard arg in CMD38.
      
      A host which supports DISCARD shall check if the DISCARD_SUPPORT (b313)
      is set in the SD_STATUS register.  If the card does not support discard,
      the host shall not issue DISCARD command, but ERASE command instead.
      
      Post the DISCARD operation, the card may de-allocate the discarded
      blocks partially or completely. So the host mustn't make any assumptions
      concerning the content of the discarded region. This is unlike ERASE
      command, in which the region is guaranteed to contain either '0's or
      '1's, depends on the content of DATA_STAT_AFTER_ERASE (b55) in the scr
      register.
      
      One more important difference compared to ERASE is the busy timeout
      which we will address on the next patch.
      Signed-off-by: NAvri Altman <avri.altman@wdc.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      bc47e2f6
    • F
      net: Remove switchdev_ops · 3d705f07
      Florian Fainelli 提交于
      Now that we have converted all possible callers to using a switchdev
      notifier for attributes we do not have a need for implementing
      switchdev_ops anymore, and this can be removed from all drivers the
      net_device structure.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d705f07