1. 04 5月, 2017 10 次提交
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • M
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS · 9070733b
      Michal Hocko 提交于
      xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
      some time ago.  We would like to make this concept more generic and use
      it for other filesystems as well.  Let's start by giving the flag a more
      generic name PF_MEMALLOC_NOFS which is in line with an exiting
      PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
      contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
      step before we introduce a full API for it as xfs uses the flag directly
      anyway.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9070733b
    • M
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko 提交于
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • J
      mm: memcontrol: provide shmem statistics · 9a4caf1e
      Johannes Weiner 提交于
      Cgroups currently don't report how much shmem they use, which can be
      useful data to have, in particular since shmem is included in the
      cache/file item while being reclaimed like anonymous memory.
      
      Add a counter to track shmem pages during charging and uncharging.
      
      Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChris Down <cdown@fb.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a4caf1e
    • S
      mm: reclaim MADV_FREE pages · 802a3a92
      Shaohua Li 提交于
      When memory pressure is high, we free MADV_FREE pages.  If the pages are
      not dirty in pte, the pages could be freed immediately.  Otherwise we
      can't reclaim them.  We put the pages back to anonumous LRU list (by
      setting SwapBacked flag) and the pages will be reclaimed in normal
      swapout way.
      
      We use normal page reclaim policy.  Since MADV_FREE pages are put into
      inactive file list, such pages and inactive file pages are reclaimed
      according to their age.  This is expected, because we don't want to
      reclaim too many MADV_FREE pages before used once pages.
      
      Based on Minchan's original patch
      
      [minchan@kernel.org: clean up lazyfree page handling]
        Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
      Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      802a3a92
    • S
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li 提交于
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • S
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li 提交于
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  2. 03 5月, 2017 1 次提交
    • S
      qed*: Fix issues in the ptp filter config implementation. · 8d3f87d8
      sudarsana.kalluru@cavium.com 提交于
      PTP hardware filter configuration performed by the driver for a given
      user requested config is not correct for some of the PTP modes.
      Following changes are needed for PTP config-filter implementation.
       1. NIG_REG_TX_PTP_EN register - Bits 0/1/2 respectively enables
          TimeSync/"V1 frame format support"/"V2 frame format support" on
          the TX side. Set the associated bits based on the user request.
       2. ptp4l application fails to operate in Peer Delay mode. Following
          changes are needed to fix this,
          a. Driver should enable (set to 0) DA #1-related bits for IPv4,
             IPv6 and MAC destination addresses in these registers:
               NIG_REG_TX_LLH_PTP_RULE_MASK
               NIG_REG_LLH_PTP_RULE_MASK
          b. NIG_REG_LLH_PTP_PARAM_MASK/NIG_REG_TX_LLH_PTP_PARAM_MASK should
             be set to 0x0 in all modes.
      Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d3f87d8
  3. 02 5月, 2017 3 次提交
  4. 01 5月, 2017 4 次提交
    • R
      qed: output the DPM status and WID count · 20b1bd96
      Ram Amrani 提交于
      Output to the RDMA driver whether DPM mode is enabled or disabled in
      the HW and if so what is the number of WIDs it supports
      Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
      Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20b1bd96
    • J
      xdp: propagate extended ack to XDP setup · ddf9f970
      Jakub Kicinski 提交于
      Drivers usually have a number of restrictions for running XDP
      - most common being buffer sizes, LRO and number of rings.
      Even though some drivers try to be helpful and print error
      messages experience shows that users don't often consult
      kernel logs on netlink errors.  Try to use the new extended
      ack mechanism to carry the message back to user space.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddf9f970
    • J
      netlink: add NULL-friendly helper for setting extended ACK message · 45d9b378
      Jakub Kicinski 提交于
      As we propagate extended ack reporting throughout various paths in
      the kernel it may be that the same function is called with the
      extended ack parameter passed as NULL.  One place where that happens
      is in drivers which have a centralized reconfiguration function
      called both from ndos and from ethtool_ops.  Add a new helper for
      setting the error message in such conditions.
      
      Existing helper is left as is to encourage propagating the ext act
      fully wherever possible.  It also makes it clear in the code which
      messages may be lost due to ext ack being NULL.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45d9b378
    • D
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams 提交于
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71389703
  5. 30 4月, 2017 1 次提交
    • H
      net/mlx5e: Update neighbour 'used' state using HW flow rules counters · f6dfb4c3
      Hadar Hen Zion 提交于
      When IP tunnel encapsulation rules are offloaded, the kernel can't see
      the traffic of the offloaded flow. The neighbour for the IP tunnel
      destination of the offloaded flow can mistakenly become STALE and
      deleted by the kernel since its 'used' value wasn't changed.
      
      To make sure that a neighbour which is used by the HW won't become
      STALE, we proactively update the neighbour 'used' value every
      DELAY_PROBE_TIME period, when packets were matched and counted by the HW
      for one of the tunnel encap flows related to this neighbour.
      
      The periodic task that updates the used neighbours is scheduled when a
      tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
      itself as long as the representor's neighbours list isn't empty.
      
      Add, remove, lookup and status change operations done over the
      representor's neighbours list or the neighbour hash entry encaps list
      are all serialized by RTNL lock.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      f6dfb4c3
  6. 29 4月, 2017 1 次提交
  7. 28 4月, 2017 6 次提交
  8. 27 4月, 2017 10 次提交
  9. 26 4月, 2017 3 次提交
    • F
      ebtables: remove nf_hook_register usage · aee12a0a
      Florian Westphal 提交于
      Similar to ip_register_table, pass nf_hook_ops to ebt_register_table().
      This allows to handle hook registration also via pernet_ops and allows
      us to avoid use of legacy register_hook api.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      aee12a0a
    • E
      net: move xdp_prog field in RX cache lines · 7acedaf5
      Eric Dumazet 提交于
      (struct net_device, xdp_prog) field should be moved in RX cache lines,
      reducing latencies when a single packet is received on idle host,
      since netif_elide_gro() needs it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7acedaf5
    • D
      net: Generic XDP · b5cdae32
      David S. Miller 提交于
      This provides a generic SKB based non-optimized XDP path which is used
      if either the driver lacks a specific XDP implementation, or the user
      requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
      
      It is arguable that perhaps I should have required something like
      this as part of the initial XDP feature merge.
      
      I believe this is critical for two reasons:
      
      1) Accessibility.  More people can play with XDP with less
         dependencies.  Yes I know we have XDP support in virtio_net, but
         that just creates another depedency for learning how to use this
         facility.
      
         I wrote this to make life easier for the XDP newbies.
      
      2) As a model for what the expected semantics are.  If there is a pure
         generic core implementation, it serves as a semantic example for
         driver folks adding XDP support.
      
      One thing I have not tried to address here is the issue of
      XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
      incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
      whatever even if the XDP program doesn't try to push headers at all.
      I think we really need the verifier to somehow propagate whether
      certain XDP helpers are used or not.
      
      v5:
       - Handle both negative and positive offset after running prog
       - Fix mac length in XDP_TX case (Alexei)
       - Use rcu_dereference_protected() in free_netdev (kbuild test robot)
      
      v4:
       - Fix MAC header adjustmnet before calling prog (David Ahern)
       - Disable LRO when generic XDP is installed (Michael Chan)
       - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
       - Do not perform generic XDP on reinjected packets (DaveM)
      
      v3:
       - Make sure XDP program sees packet at MAC header, push back MAC
         header if we do XDP_TX.  (Alexei)
       - Elide GRO when generic XDP is in use.  (Alexei)
       - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
         XDP even if the driver has an XDP implementation.  (Alexei)
       - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
         attribute.  (Daniel)
      
      v2:
       - Add some "fall through" comments in switch statements based
         upon feedback from Andrew Lunn
       - Use RCU for generic xdp_prog, thanks to Johannes Berg.
      Tested-by: NAndy Gospodarek <andy@greyhouse.net>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5cdae32
  10. 25 4月, 2017 1 次提交