1. 29 7月, 2016 21 次提交
    • M
      mm, mmzone: clarify the usage of zone padding · 0f661148
      Mel Gorman 提交于
      Zone padding separates write-intensive fields used by page allocation,
      compaction and vmstats but the comments are a little misleading and need
      clarification.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f661148
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmscan: move lru_lock to the node · a52633d8
      Mel Gorman 提交于
      Node-based reclaim requires node-based LRUs and locking.  This is a
      preparation patch that just moves the lru_lock to the node so later
      patches are easier to review.  It is a mechanical change but note this
      patch makes contention worse because the LRU lock is hotter and direct
      reclaim and kswapd can contend on the same lock even when reclaiming
      from different zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52633d8
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
    • M
      mm, meminit: remove early_page_nid_uninitialised · a621184a
      Mel Gorman 提交于
      The helper early_page_nid_uninitialised() has been dead since commit
      974a786e ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
      dead code.
      
      Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a621184a
    • M
      cpuset, mm: fix TIF_MEMDIE check in cpuset_change_task_nodemask · fec1e5f9
      Michal Hocko 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
      it is checking the flag on the current task rather than the given one.
      
      This doesn't make much sense and it is actually wrong.  If the current
      task which updates the nodemask of a cpuset got killed by the OOM killer
      then a part of the cpuset cgroup processes would have incompatible
      nodemask which is surprising to say the least.
      
      The comment suggests the intention was to skip oom victim or an exiting
      task so we should be checking the given task.  But even then it would be
      layering violation because it is the memory allocator to interpret the
      TIF_MEMDIE meaning.  Simply drop both checks.  All tasks in the cpuset
      should simply follow the same mask.
      
      Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fec1e5f9
    • M
      freezer, oom: check TIF_MEMDIE on the correct task · a34c80a7
      Michal Hocko 提交于
      freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
      It is, however, checking the flag on the current task rather than the
      given one.  This is really confusing because freezing() can be called
      also on !current tasks.  It would end up working correctly for its main
      purpose because __refrigerator will be always called on the current task
      so the oom victim will never get frozen.  But it could lead to
      surprising results when a task which is freezing a cgroup got oom killed
      because only part of the cgroup would get frozen.  This is highly
      unlikely but worth fixing as the resulting code would be more clear
      anyway.
      
      Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a34c80a7
    • G
      mm/compaction: remove unnecessary order check in try_to_compact_pages() · b2b331f9
      Ganesh Mahendran 提交于
      The caller __alloc_pages_direct_compact() already checked (order == 0)
      so there's no need to check again.
      
      Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.comSigned-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2b331f9
    • J
      mm: fix vm-scalability regression in cgroup-aware workingset code · 55779ec7
      Johannes Weiner 提交于
      Commit 23047a96 ("mm: workingset: per-cgroup cache thrash
      detection") added a page->mem_cgroup lookup to the cache eviction,
      refault, and activation paths, as well as locking to the activation
      path, and the vm-scalability tests showed a regression of -23%.
      
      While the test in question is an artificial worst-case scenario that
      doesn't occur in real workloads - reading two sparse files in parallel
      at full CPU speed just to hammer the LRU paths - there is still some
      optimizations that can be done in those paths.
      
      Inline the lookup functions to eliminate calls.  Also, page->mem_cgroup
      doesn't need to be stabilized when counting an activation; we merely
      need to hold the RCU lock to prevent the memcg from being freed.
      
      This cuts down on overhead quite a bit:
      
      23047a96 063f6715e77a7be5770d6081fe
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
        21621405 +- 0%     +11.3%   24069657 +- 2%  vm-scalability.throughput
      
      [linux@roeck-us.net: drop unnecessary include file]
      [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
        Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
      Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.orgReported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55779ec7
    • Z
      mm: update the comment in __isolate_free_page · 400bc7fd
      zhong jiang 提交于
      We need to assure the comment is consistent with the code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      400bc7fd
    • M
      mm, oom: tighten task_will_free_mem() locking · 091f362c
      Michal Hocko 提交于
      "mm, oom: fortify task_will_free_mem" has dropped task_lock around
      task_will_free_mem in oom_kill_process bacause it assumed that a
      potential race when the selected task exits will not be a problem as the
      oom_reaper will call exit_oom_victim.
      
      Tetsuo was objecting that nommu doesn't have oom_reaper so the race
      would be still possible.  The code would be racy and lockup prone
      theoretically in other aspects without the oom reaper anyway so I didn't
      considered this a big deal.  But it seems that further changes I am
      planning in this area will benefit from stable task->mm in this path as
      well.  So let's drop find_lock_task_mm from task_will_free_mem and call
      it from under task_lock as we did previously.  Just pull the task->mm !=
      NULL check inside the function.
      
      Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      091f362c
    • M
      mm, oom: hide mm which is shared with kthread or global init · a373966d
      Michal Hocko 提交于
      The only case where the oom_reaper is not triggered for the oom victim
      is when it shares the memory with a kernel thread (aka use_mm) or with
      the global init.  After "mm, oom: skip vforked tasks from being
      selected" the victim cannot be a vforked task of the global init so we
      are left with clone(CLONE_VM) (without CLONE_SIGHAND).  use_mm() users
      are quite rare as well.
      
      In order to help forward progress for the OOM killer, make sure that
      this really rare case will not get in the way - we do this by hiding the
      mm from the oom killer by setting MMF_OOM_REAPED flag for it.
      oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
      MMF_OOM_REAPED flag set to catch these oom victims.
      
      After this patch we should guarantee forward progress for the OOM killer
      even when the selected victim is sharing memory with a kernel thread or
      global init as long as the victims mm is still alive.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a373966d
    • M
      mm, oom_reaper: do not attempt to reap a task more than twice · 11a410d5
      Michal Hocko 提交于
      oom_reaper relies on the mmap_sem for read to do its job.  Many places
      which might block readers have been converted to use down_write_killable
      and that has reduced chances of the contention a lot.  Some paths where
      the mmap_sem is held for write can take other locks and they might
      either be not prepared to fail due to fatal signal pending or too
      impractical to be changed.
      
      This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
      first attempt to reap a task's mm fails.  If the flag is present after
      the failure then we set MMF_OOM_REAPED to hide this mm from the oom
      killer completely so it can go and chose another victim.
      
      As a result a risk of OOM deadlock when the oom victim would be blocked
      indefinetly and so the oom killer cannot make any progress should be
      mitigated considerably while we still try really hard to perform all
      reclaim attempts and stay predictable in the behavior.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a410d5
    • M
      mm, oom: task_will_free_mem should skip oom_reaped tasks · 696453e6
      Michal Hocko 提交于
      The 0-day robot has encountered the following:
      
         Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
         Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
      
      oom_reaper is trying to reap the same task again and again.
      
      This is possible only when the oom killer is bypassed because of
      task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
      already set during select_bad_process.  Teach task_will_free_mem to skip
      over MMF_OOM_REAPED tasks as well because they will be unlikely to free
      anything more.
      
      Analyzed by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      696453e6
    • M
      mm, oom: fortify task_will_free_mem() · 1af8bb43
      Michal Hocko 提交于
      task_will_free_mem is rather weak.  It doesn't really tell whether the
      task has chance to drop its mm.  98748bd7 ("oom: consider
      multi-threaded tasks in task_will_free_mem") made a first step into making
      it more robust for multi-threaded applications so now we know that the
      whole process is going down and probably drop the mm.
      
      This patch builds on top for more complex scenarios where mm is shared
      between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
      use_mm().
      
      Make sure that all processes sharing the mm are killed or exiting.  This
      will allow us to replace try_oom_reaper by wake_oom_reaper because
      task_will_free_mem implies the task is reapable now.  Therefore all paths
      which bypass the oom killer are now reapable and so they shouldn't lock up
      the oom killer.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8bb43
    • M
      mm, oom: kill all tasks sharing the mm · 97fd49c2
      Michal Hocko 提交于
      Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
      process sharing the same mm is unkillable via OOM_ADJUST_MIN.  After "mm,
      oom_adj: make sure processes sharing mm have same view of oom_score_adj"
      all such processes are sharing the same value so we shouldn't see such a
      task at all (oom_badness would rule them out).
      
      We can still encounter oom disabled vforked task which has to be killed as
      well if we want to have other tasks sharing the mm reapable because it can
      access the memory before doing exec.  Killing such a task should be
      acceptable because it is highly unlikely it has done anything useful
      because it cannot modify any memory before it calls exec.  An alternative
      would be to keep the task alive and skip the oom reaper and risk all the
      weird corner cases where the OOM killer cannot make forward progress
      because the oom victim hung somewhere on the way to exit.
      
      [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
       the setting is inherently racy and we cannot do much about it without
       introducing locks in hot paths]
      Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97fd49c2
    • M
      mm, oom: skip vforked tasks from being selected · b18dc5f2
      Michal Hocko 提交于
      vforked tasks are not really sitting on any memory.  They are sharing the
      mm with parent until they exec into a new code.  Until then it is just
      pinning the address space.  OOM killer will kill the vforked task along
      with its parent but we still can end up selecting vforked task when the
      parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
      vforked being a child of oom unkillable task with an updated oom_score_adj
      to be killable.
      
      Add a new helper to check whether a task is in the vfork sharing memory
      with its parent and use it in oom_badness to skip over these tasks.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18dc5f2
    • M
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj · 44a70ade
      Michal Hocko 提交于
      oom_score_adj is shared for the thread groups (via struct signal) but this
      is not sufficient to cover processes sharing mm (CLONE_VM without
      CLONE_SIGHAND) and so we can easily end up in a situation when some
      processes update their oom_score_adj and confuse the oom killer.  In the
      worst case some of those processes might hide from the oom killer
      altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
      would then pick up those eligible but won't be allowed to kill others
      sharing the same mm so the mm wouldn't release the mm and so the memory.
      
      It would be ideal to have the oom_score_adj per mm_struct because that is
      the natural entity OOM killer considers.  But this will not work because
      some programs are doing
      
      	vfork()
      	set_oom_adj()
      	exec()
      
      We can achieve the same though.  oom_score_adj write handler can set the
      oom_score_adj for all processes sharing the same mm if the task is not in
      the middle of vfork.  As a result all the processes will share the same
      oom_score_adj.  The current implementation is rather pessimistic and
      checks all the existing processes by default if there is more than 1
      holder of the mm but we do not have any reliable way to check for external
      users yet.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a70ade
    • M
      proc, oom_adj: extract oom_score_adj setting into a helper · 1d5f0acb
      Michal Hocko 提交于
      Currently we have two proc interfaces to set oom_score_adj.  The legacy
      /proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
      specific handlers.  Big part of the logic is duplicated so extract the
      common code into __set_oom_adj helper.  Legacy knob still expects some
      details slightly different so make sure those are handled same way - e.g.
      the legacy mode ignores oom_score_adj_min and it warns about the usage.
      
      This patch shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5f0acb
    • M
      proc, oom: drop bogus sighand lock · f913da59
      Michal Hocko 提交于
      Oleg has pointed out that can simplify both oom_adj_{read,write} and
      oom_score_adj_{read,write} even further and drop the sighand lock.  The
      main purpose of the lock was to protect p->signal from going away but this
      will not happen since ea6d290c ("signals: make task_struct->signal
      immutable/refcountable").
      
      The other role of the lock was to synchronize different writers,
      especially those with CAP_SYS_RESOURCE.  Introduce a mutex for this
      purpose.  Later patches will need this lock anyway.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/1466426628-15074-3-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f913da59
    • M
      proc, oom: drop bogus task_lock and mm check · d49fbf76
      Michal Hocko 提交于
      Series "Handle oom bypass more gracefully", V5
      
      The following 10 patches should put some order to very rare cases of mm
      shared between processes and make the paths which bypass the oom killer
      oom reapable and therefore much more reliable finally.  Even though mm
      shared outside of thread group is rare (either vforked tasks for a short
      period, use_mm by kernel threads or exotic thread model of
      clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them.  Not
      only it makes the current oom killer logic quite hard to follow and
      reason about it can lead to weird corner cases.  E.g.  it is possible to
      select an oom victim which shares the mm with unkillable process or
      bypass the oom killer even when other processes sharing the mm are still
      alive and other weird cases.
      
      Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
      This can be considered a bug fix with a low impact as nobody has noticed
      for years.
      
      Patch 2 drops sighand lock because it is not needed anymore as pointed
      by Oleg.
      
      Patch 3 is a clean up of oom_score_adj handling and a preparatory work
      for later patches.
      
      Patch 4 enforces oom_adj_score to be consistent between processes
      sharing the mm to behave consistently with the regular thread groups.
      This can be considered a user visible behavior change because one thread
      group updating oom_score_adj will affect others which share the same mm
      via clone(CLONE_VM).  I argue that this should be acceptable because we
      already have the same behavior for threads in the same thread group and
      sharing the mm without signal struct is just a different model of
      threading.  This is probably the most controversial part of the series,
      I would like to find some consensus here.  There were some suggestions
      to hook some counter/oom_score_adj into the mm_struct but I feel that
      this is not necessary right now and we can rely on proc handler +
      oom_kill_process to DTRT.  I can be convinced otherwise but I strongly
      think that whatever we do the userspace has to have a way to see the
      current oom priority as consistently as possible.
      
      Patch 5 makes sure that no vforked task is selected if it is sharing the
      mm with oom unkillable task.
      
      Patch 6 ensures that all user tasks sharing the mm are killed which in
      turn makes sure that all oom victims are oom reapable.
      
      Patch 7 guarantees that task_will_free_mem will always imply reapable
      bypass of the oom killer.
      
      Patch 8 is new in this version and it addresses an issue pointed out by
      0-day OOM report where an oom victim was reaped several times.
      
      Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
      task and hides it from the oom killer to move on when no progress can be
      made.  This will give an upper bound to how long an oom_reapable task
      can block the oom killer from selecting another victim if the oom_reaper
      is not able to reap the victim.
      
      Patch 10 tries to plug the (hopefully) last hole when we can still lock
      up when the oom victim is shared with oom unkillable tasks (kthreads and
      global init).  We just try to be best effort in that case and rather
      fallback to kill something else than risk a lockup.
      
      This patch (of 10):
      
      Both oom_adj_write and oom_score_adj_write are using task_lock, check for
      task->mm and fail if it is NULL.  This is not needed because the
      oom_score_adj is per signal struct so we do not need mm at all.  The code
      has been introduced by 3d5992d2 ("oom: add per-mm oom disable count")
      but we do not do per-mm oom disable since c9f01245 ("oom: remove
      oom_disable_count").
      
      The task->mm check is even not correct because the current thread might
      have exited but the thread group might be still alive - e.g.  thread group
      leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
      fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
      succeed.  This is unexpected at best.
      
      Remove the lock along with the check to fix the unexpected behavior and
      also because there is not real need for the lock in the first place.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49fbf76
  2. 28 7月, 2016 19 次提交
    • L
      Add braces to avoid "ambiguous ‘else’" compiler warnings · 194dc870
      Linus Torvalds 提交于
      Some of our "for_each_xyz()" macro constructs make gcc unhappy about
      lack of braces around if-statements inside or outside the loop, because
      the loop construct itself has a "if-then-else" statement inside of it.
      
      The resulting warnings look something like this:
      
        drivers/gpu/drm/i915/i915_debugfs.c: In function ‘i915_dump_lrc’:
        drivers/gpu/drm/i915/i915_debugfs.c:2103:6: warning: suggest explicit braces to avoid ambiguous ‘else’ [-Wparentheses]
           if (ctx != dev_priv->kernel_context)
              ^
      
      even if the code itself is fine.
      
      Since the warning is fairly easy to avoid by adding a braces around the
      if-statement near the for_each_xyz() construct, do so, rather than
      disabling the otherwise potentially useful warning.
      
      (The if-then-else statements used in the "for_each_xyz()" constructs are
      designed to be inherently safe even with no braces, but in this case
      it's quite understandable that gcc isn't really able to tell that).
      
      This finally leaves the standard "allmodconfig" build with just a
      handful of remaining warnings, so new and valid warnings hopefully will
      stand out.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      194dc870
    • L
      Disable "frame-address" warning · 124a3d88
      Linus Torvalds 提交于
      Newer versions of gcc warn about the use of __builtin_return_address()
      with a non-zero argument when "-Wall" is specified:
      
        kernel/trace/trace_irqsoff.c: In function ‘stop_critical_timings’:
        kernel/trace/trace_irqsoff.c:433:86: warning: calling ‘__builtin_return_address’ with a nonzero argument is unsafe [-Wframe-address]
           stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
        [ .. repeats a few times for other similar cases .. ]
      
      It is true that a non-zero argument is somewhat dangerous, and we do not
      actually have very many uses of that in the kernel - but the ftrace code
      does use it, and as Stephen Rostedt says:
      
       "We are well aware of the danger of using __builtin_return_address() of
        > 0.  In fact that's part of the reason for having the "thunk" code in
        x86 (See arch/x86/entry/thunk_{64,32}.S).  [..] it adds extra frames
        when tracking irqs off sections, to prevent __builtin_return_address()
        from accessing bad areas.  In fact the thunk_32.S states: 'Trampoline to
        trace irqs off.  (otherwise CALLER_ADDR1 might crash)'."
      
      For now, __builtin_return_address() with a non-zero argument is the best
      we can do, and the warning is not helpful and can end up making people
      miss other warnings for real problems.
      
      So disable the frame-address warning on compilers that need it.
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      124a3d88
    • L
      Merge tag 'hsi-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi · 8448cefe
      Linus Torvalds 提交于
      Pull HSI updates from Sebastian Reichel:
      
       - proper runtime pm support for omap-ssi and ssi-protocol
      
       - misc fixes
      
      * tag 'hsi-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi: (24 commits)
        HSI: omap_ssi: drop pm_runtime_irq_safe
        HSI: omap_ssi_port: use rpm autosuspend API
        HSI: omap_ssi: call msg->complete() from process context
        HSI: omap_ssi_port: ensure clocks are kept enabled during transfer
        HSI: omap_ssi_port: replace pm_runtime_put_sync with non-sync variant
        HSI: omap_ssi_port: avoid calling runtime_pm_*_sync inside spinlock
        HSI: omap_ssi_port: avoid pm_runtime_get_sync in ssi_start_dma and ssi_start_pio
        HSI: omap_ssi_port: switch to threaded pio irq
        HSI: omap_ssi_core: remove pm_runtime_get_sync call from tasklet
        HSI: omap_ssi_core: use pm_runtime_put instead of pm_runtime_put_sync
        HSI: omap_ssi_port: prepare start_tx/stop_tx for blocking pm_runtime calls
        HSI: core: switch port event notifier from atomic to blocking
        HSI: omap_ssi_port: replace wkin_cken with atomic bitmap operations
        HSI: omap_ssi: convert cawake irq handler to thread
        HSI: ssi_protocol: fix ssip_xmit invocation
        HSI: ssi_protocol: replace spin_lock with spin_lock_bh
        HSI: ssi_protocol: avoid ssi_waketest call with held spinlock
        HSI: omap_ssi: do not reset module
        HSI: omap_ssi_port: remove useless newline
        hsi: Only descend into hsi directory when CONFIG_HSI is set
        ...
      8448cefe
    • L
      Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 818e607b
      Linus Torvalds 提交于
      Pull random driver updates from Ted Ts'o:
       "A number of improvements for the /dev/random driver; the most
        important is the use of a ChaCha20-based CRNG for /dev/urandom, which
        is faster, more efficient, and easier to make scalable for
        silly/abusive userspace programs that want to read from /dev/urandom
        in a tight loop on NUMA systems.
      
        This set of patches also improves entropy gathering on VM's running on
        Microsoft Azure, and will take advantage of a hw random number
        generator (if present) to initialize the /dev/urandom pool"
      
      (It turns out that the random tree hadn't been in linux-next this time
      around, because it had been dropped earlier as being too quiet.  Oh
      well).
      
      * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
        random: strengthen input validation for RNDADDTOENTCNT
        random: add backtracking protection to the CRNG
        random: make /dev/urandom scalable for silly userspace programs
        random: replace non-blocking pool with a Chacha20-based CRNG
        random: properly align get_random_int_hash
        random: add interrupt callback to VMBus IRQ handler
        random: print a warning for the first ten uninitialized random users
        random: initialize the non-blocking pool via add_hwgenerator_randomness()
      818e607b
    • L
      Merge tag 'media/v4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · ff9a082f
      Linus Torvalds 提交于
      Pull media documentation updates from Mauro Carvalho Chehab:
       "This patch series does the conversion of all media documentation stuff
        to Restrutured Text markup format and add them to the
        Documentation/index.rst file.
      
        The media documentation was grouped into 4 books:
      
          - media uAPI
          - media kAPI
          - V4L driver-specific documentation
          - DVB driver-specific documentation
      
        It also contains several documentation improvements and one fixup
        patch for a core issue with cropcap.
      
        PS.  After this patch series, the media DocBook is deprecated and
        should be removed.  I'll add such patch on a future pull request"
      
      * tag 'media/v4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (322 commits)
        [media] cx23885-cardlist.rst: add a new card
        [media] doc-rst: add some needed escape codes
        [media] doc-rst: kapi: use :c:func: instead of :cpp:func
        doc-rst: kernel-doc: fix a change introduced by mistake
        [media] v4l2-ioctl.h add debug info for struct v4l2_ioctl_ops
        [media] dvb_ringbuffer.h: some documentation improvements
        [media] v4l2-ctrls.h: fully document the header file
        [media] doc-rst: Fix some typedef ugly warnings
        [media] doc-rst: reorganize the kAPI v4l2 chapters
        [media] rename v4l2-framework.rst to v4l2-intro.rst
        [media] move V4L2 clocks to a separate .rst file
        [media] v4l2-fh.rst: add cross references and markups
        [media] v4l2-fh.rst: add fh contents from v4l2-framework.rst
        [media] v4l2-fh.h: add documentation for it
        [media] v4l2-event.rst: add cross-references and markups
        [media] v4l2-event.h: document all functions
        [media] v4l2-event.rst: add text from v4l2-framework.rst
        [media] v4l2-framework.rst: remove videobuf quick chapter
        [media] v4l2-dev: add cross-references and improve markup
        [media] doc-rst: move v4l2-dev doc to a separate file
        ...
      ff9a082f
    • L
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 6a492b0f
      Linus Torvalds 提交于
      Pull SCSI updates from James Bottomley:
       "This update includes the usual round of driver updates (fcoe, lpfc,
        ufs, qla2xxx, hisi_sas).  The most important other change is removing
        the flag to allow non-blk_mq on a per host basis (it's unused); there
        is still a global module parameter for all of SCSI just in case.
      
        The rest are an assortment of minor fixes and typo updates"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (101 commits)
        scsi:libsas: fix oops caused by assigning a freed task to ->lldd_task
        fnic: pci_dma_mapping_error() doesn't return an error code
        scsi: lpfc: avoid harmless comparison warning
        fcoe: implement FIP VLAN responder
        fcoe: Rename 'fip_frame' to 'fip_vn2vn_notify_frame'
        lpfc: call lpfc_sli_validate_fcp_iocb() with the hbalock held
        scsi: ufs: remove unnecessary goto label
        hpsa: change hpsa_passthru_ioctl timeout
        hpsa: correct skipping masked peripherals
        qla2xxx: Update driver version to 8.07.00.38-k
        qla2xxx: Fix BBCR offset
        qla2xxx: Fix duplicate message id.
        qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.
        qla2xxx: Separate ISP type bits out from device type.
        qla2xxx: Correction to function qla26xx_dport_diagnostics().
        qla2xxx: Add support to handle Loop Init error Asynchronus event.
        qla2xxx: Let DPORT be enabled purely by nvram.
        qla2xxx: Add bsg interface to support statistics counter reset.
        qla2xxx: Add bsg interface to support D_Port Diagnostics.
        qla2xxx: Check for device state before unloading the driver.
        ...
      6a492b0f
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · d85486d4
      Linus Torvalds 提交于
      Pull input updates from Dmitry Torokhov:
       "Updates for the input subsystem.  This contains the following new
        drivers promised in the last merge window:
      
         - driver for touchscreen controller found in Surface 3
         - driver for Pegasus Notetaker tablet
         - driver for Atmel Captouch Buttons
         - driver for Raydium I2C touchscreen controllers
         - powerkey driver for HISI 65xx SoC
      
        plus a few fixes"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
        Input: tty/vt/keyboard - use memdup_user()
        Input: pegasus_notetaker - set device mode in reset_resume() if in use
        Input: pegasus_notetaker - cancel workqueue's work in suspend()
        Input: pegasus_notetaker - fix usb_autopm calls to be balanced
        Input: pegasus_notetaker - handle usb control msg errors
        Input: wacom_w8001 - handle errors from input_mt_init_slots()
        Input: wacom_w8001 - resolution wasn't set for ABS_MT_POSITION_X/Y
        Input: pixcir_ts - add support for axis inversion / swapping
        Input: icn8318 - use of_touchscreen helpers for inverting / swapping axes
        Input: edt-ft5x06 - add support for inverting / swapping axes
        Input: of_touchscreen - add support for inverted / swapped axes
        Input: synaptics-rmi4 - use the RMI_F11_REL_BYTES define in rmi_f11_rel_pos_report
        Input: synaptics-rmi4 - remove unneeded variable
        Input: synaptics-rmi4 - remove pointer to rmi_function in f12_data
        Input: synaptics-rmi4 - support regulator supplies
        Input: raydium_i2c_ts - check CRC of incoming packets
        Input: xen-kbdfront - prefer xenbus_write() over xenbus_printf() where possible
        Input: fix a double word "is is" in include/linux/input.h
        Input: add powerkey driver for HISI 65xx SoC
        Input: apanel - spelling mistake - "skiping" -> "skipping"
        ...
      d85486d4
    • L
      Merge branch 'i2c/for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 66304207
      Linus Torvalds 提交于
      Pull i2c updates from Wolfram Sang:
       "Here is the I2C pull request for 4.8:
      
         - the core and i801 driver gained support for SMBus Host Notify
      
         - core support for more than one address in DT
      
         - i2c_add_adapter() has now better error messages.  We can remove all
           error messages from drivers calling it as a next step.
      
         - bigger updates to rk3x driver to support rk3399 SoC
      
         - the at24 eeprom driver got refactored and can now read special
           variants with unique serials or fixed MAC addresses.
      
        The rest is regular driver updates and bugfixes"
      
      * 'i2c/for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (66 commits)
        i2c: i801: use IS_ENABLED() instead of checking for built-in or module
        Documentation: i2c: slave: give proper example for pm usage
        Documentation: i2c: slave: describe buffer problems a bit better
        i2c: bcm2835: Don't complain on -EPROBE_DEFER from getting our clock
        i2c: i2c-smbus: drop useless stubs
        i2c: efm32: fix a failure path in efm32_i2c_probe()
        Revert "i2c: core: Cleanup I2C ACPI namespace"
        Revert "i2c: core: Add function for finding the bus speed from ACPI"
        i2c: Update the description of I2C_SMBUS
        i2c: i2c-smbus: fix i2c_handle_smbus_host_notify documentation
        eeprom: at24: tweak the loop_until_timeout() macro
        eeprom: at24: add support for at24mac series
        eeprom: at24: support reading the serial number for 24csxx
        eeprom: at24: platform_data: use BIT() macro
        eeprom: at24: split at24_eeprom_write() into specialized functions
        eeprom: at24: split at24_eeprom_read() into specialized functions
        eeprom: at24: hide the read/write loop behind a macro
        eeprom: at24: call read/write functions via function pointers
        eeprom: at24: coding style fixes
        eeprom: at24: move at24_read() below at24_eeprom_write()
        ...
      66304207
    • L
      Merge tag 'spi-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 7ae0ae4a
      Linus Torvalds 提交于
      Pull spi updates from Mark Brown:
       "Quite a lot of cleanup and maintainence work going on this release in
        various drivers, and also a fix for a nasty locking issue in the core:
      
         - A fix for locking issues when external drivers explicitly locked
           the bus with spi_bus_lock() - we were using the same lock to both
           control access to the physical bus in multi-threaded I/O operations
           and exclude multiple callers.
      
           Confusion between these two caused us to have scenarios where we
           were dropping locks.  These are fixed by splitting into two
           separate locks like should have been done originally, making
           everything much clearer and correct.
      
         - Support for DMA in spi_flash_read().
      
         - Support for instantiating spidev on ACPI systems, including some
           test devices used in Windows validation.
      
         - Use of the core DMA mapping functionality in the McSPI driver.
      
         - Start of support for ThunderX SPI controllers, involving a very big
           set of changes to the Cavium driver.
      
         - Support for Braswell, Exynos 5433, Kaby Lake, Merrifield, RK3036,
           RK3228, RK3368 controllers"
      
      * tag 'spi-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (64 commits)
        spi: Split bus and I/O locking
        spi: octeon: Split driver into Octeon specific and common parts
        spi: octeon: Move include file from arch/mips to drivers/spi
        spi: octeon: Put register offsets into a struct
        spi: octeon: Store system clock freqency in struct octeon_spi
        spi: octeon: Convert driver to use readq()/writeq() functions
        spi: pic32-sqi: fixup wait_for_completion_timeout return handling
        spi: pic32: fixup wait_for_completion_timeout return handling
        spi: rockchip: limit transfers to (64K - 1) bytes
        spi: xilinx: Return IRQ_NONE if no interrupts were detected
        spi: xilinx: Handle errors from platform_get_irq()
        spi: s3c64xx: restore removed comments
        spi: s3c64xx: add Exynos5433 compatible for ioclk handling
        spi: s3c64xx: use error code from clk_prepare_enable()
        spi: s3c64xx: rename goto labels to meaningful names
        spi: s3c64xx: document the clocks and the clock-name property
        spi: s3c64xx: add exynos5433 spi compatible
        spi: s3c64xx: fix reference leak to master in s3c64xx_spi_remove()
        spi: spi-sh: Remove deprecated create_singlethread_workqueue
        spi: spi-topcliff-pch: Remove deprecated create_singlethread_workqueue
        ...
      7ae0ae4a
    • L
      Merge tag 'leds_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds · 607e11ab
      Linus Torvalds 提交于
      Pull LED updates from Jacek Anaszewski:
       "New LED class driver:
         - LED driver for TI LP3952 6-Channel Color LED
      
        LED core improvements:
         - Only descend into leds directory when CONFIG_NEW_LEDS is set
         - Add no-op gpio_led_register_device when LED subsystem is disabled
         - MAINTAINERS: Add file patterns for led device tree bindings
      
        LED Trigger core improvements:
         - return error if invalid trigger name is provided via sysfs
      
        LED class drivers improvements
         - is31fl32xx: define complete i2c_device_id table
         - is31fl32xx: fix typo in id and match table names
         - leds-gpio: Set of_node for created LED devices
         - pca9532: Add device tree support
      
        Conversion of IDE trigger to common disk trigger:
         - leds: convert IDE trigger to common disk trigger
         - leds: documentation: 'ide-disk' to 'disk-activity'
         - unicore32: use the new LED disk activity trigger
         - parisc: use the new LED disk activity trigger
         - mips: use the new LED disk activity trigger
         - arm: use the new LED disk activity trigger
         - powerpc: use the new LED disk activity trigger"
      
      * tag 'leds_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
        leds: is31fl32xx: define complete i2c_device_id table
        leds: is31fl32xx: fix typo in id and match table names
        leds: LED driver for TI LP3952 6-Channel Color LED
        leds: leds-gpio: Set of_node for created LED devices
        leds: triggers: return error if invalid trigger name is provided via sysfs
        leds: Only descend into leds directory when CONFIG_NEW_LEDS is set
        leds: Add no-op gpio_led_register_device when LED subsystem is disabled
        unicore32: use the new LED disk activity trigger
        parisc: use the new LED disk activity trigger
        mips: use the new LED disk activity trigger
        arm: use the new LED disk activity trigger
        powerpc: use the new LED disk activity trigger
        leds: documentation: 'ide-disk' to 'disk-activity'
        leds: convert IDE trigger to common disk trigger
        leds: pca9532: Add device tree support
        MAINTAINERS: Add file patterns for led device tree bindings
      607e11ab
    • L
      Merge tag 'for-linus-4.8' of git://git.code.sf.net/p/openipmi/linux-ipmi · 78d51aee
      Linus Torvalds 提交于
      Pull IPMI updates from Corey Minyard:
       "Remove some old cruft that was disabled by default a long time ago.
      
        No modern hardware should need this, and anybody who really doesn't
        have something to automatically detect IPMI can add the device by hand
        on the module commandline or hot add it"
      
      * tag 'for-linus-4.8' of git://git.code.sf.net/p/openipmi/linux-ipmi:
        ipmi: remove trydefaults parameter and default init
      78d51aee
    • L
      Merge tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · c79a14de
      Linus Torvalds 提交于
      Pull EDAC updates from Borislav Petkov:
       "This last cycle, Thor was busy adding Arria10 eth FIFO support to the
        altera_edac driver along with other improvements.  We have two
        cleanups/fixes too.
      
        Summary:
      
         - Altera Arria10 ethernet FIFO buffer support (Thor Thayer)
      
         - Minor cleanups"
      
      * tag 'edac_for_4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
        ARM: dts: Add Arria10 Ethernet EDAC devicetree entry
        EDAC, altera: Add Arria10 Ethernet EDAC support
        EDAC, altera: Add Arria10 ECC memory init functions
        Documentation: dt: socfpga: Add Arria10 Ethernet binding
        EDAC, altera: Drop some ifdeffery
        EDAC, altera: Add panic flag check to A10 IRQ
        EDAC, altera: Check parent status for Arria10 EDAC block
        EDAC, altera: Make all private data structures static
        EDAC: Correct channel count limit
        EDAC, amd64_edac: Init opstate at the proper time during init
        EDAC, altera: Handle Arria10 SDRAM child node
        EDAC, altera: Add ECC Manager IRQ controller support
        Documentation: dt: socfpga: Add interrupt-controller to ecc-manager
      c79a14de
    • L
      Disable "maybe-uninitialized" warning globally · 6e8d666e
      Linus Torvalds 提交于
      Several build configurations had already disabled this warning because
      it generates a lot of false positives.  But some had not, and it was
      still enabled for "allmodconfig" builds, for example.
      
      Looking at the warnings produced, every single one I looked at was a
      false positive, and the warnings are frequent enough (and big enough)
      that they can easily hide real problems that you don't notice in the
      noise generated by -Wmaybe-uninitialized.
      
      The warning is good in theory, but this is a classic case of a warning
      that causes more problems than the warning can solve.
      
      If gcc gets better at avoiding false positives, we may be able to
      re-enable this warning.  But as is, we're better off without it, and I
      want to be able to see the *real* warnings.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e8d666e
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next · 468fc7ed
      Linus Torvalds 提交于
      Pull networking updates from David Miller:
      
       1) Unified UDP encapsulation offload methods for drivers, from
          Alexander Duyck.
      
       2) Make DSA binding more sane, from Andrew Lunn.
      
       3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
      
       4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
      
       5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
          packets as soon as the device sees them, with the option to mirror
          the packet on TX via the same interface.  From Brenden Blanco and
          others.
      
       6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
      
       7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
      
       8) Simplify netlink conntrack entry layout, from Florian Westphal.
      
       9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
          Schimmel, Yotam Gigi, and Jiri Pirko.
      
      10) Add SKB array infrastructure and convert tun and macvtap over to it.
          From Michael S Tsirkin and Jason Wang.
      
      11) Support qdisc packet injection in pktgen, from John Fastabend.
      
      12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
      
      13) Add NV congestion control support to TCP, from Lawrence Brakmo.
      
      14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
      
      15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
      
      16) Support MPLS over IPV4, from Simon Horman.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
        xgene: Fix build warning with ACPI disabled.
        be2net: perform temperature query in adapter regardless of its interface state
        l2tp: Correctly return -EBADF from pppol2tp_getname.
        net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
        net: ipmr/ip6mr: update lastuse on entry change
        macsec: ensure rx_sa is set when validation is disabled
        tipc: dump monitor attributes
        tipc: add a function to get the bearer name
        tipc: get monitor threshold for the cluster
        tipc: make cluster size threshold for monitoring configurable
        tipc: introduce constants for tipc address validation
        net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
        MAINTAINERS: xgene: Add driver and documentation path
        Documentation: dtb: xgene: Add MDIO node
        dtb: xgene: Add MDIO node
        drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
        drivers: net: xgene: Use exported functions
        drivers: net: xgene: Enable MDIO driver
        drivers: net: xgene: Add backward compatibility
        drivers: net: phy: xgene: Add MDIO driver
        ...
      468fc7ed
    • L
      Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 08fd8c17
      Linus Torvalds 提交于
      Pull xen updates from David Vrabel:
       "Features and fixes for 4.8-rc0:
      
         - ACPI support for guests on ARM platforms.
         - Generic steal time support for arm and x86.
         - Support cases where kernel cpu is not Xen VCPU number (e.g., if
           in-guest kexec is used).
         - Use the system workqueue instead of a custom workqueue in various
           places"
      
      * tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
        xen: add static initialization of steal_clock op to xen_time_ops
        xen/pvhvm: run xen_vcpu_setup() for the boot CPU
        xen/evtchn: use xen_vcpu_id mapping
        xen/events: fifo: use xen_vcpu_id mapping
        xen/events: use xen_vcpu_id mapping in events_base
        x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
        x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
        xen: introduce xen_vcpu_id mapping
        x86/acpi: store ACPI ids from MADT for future usage
        x86/xen: update cpuid.h from Xen-4.7
        xen/evtchn: add IOCTL_EVTCHN_RESTRICT
        xen-blkback: really don't leak mode property
        xen-blkback: constify instance of "struct attribute_group"
        xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
        xen-blkback: prefer xenbus_scanf() over xenbus_gather()
        xen: support runqueue steal time on xen
        arm/xen: add support for vm_assist hypercall
        xen: update xen headers
        xen-pciback: drop superfluous variables
        xen-pciback: short-circuit read path used for merging write values
        ...
      08fd8c17
    • L
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · e831101a
      Linus Torvalds 提交于
      Pull arm64 updates from Catalin Marinas:
      
       - Kexec support for arm64
      
       - Kprobes support
      
       - Expose MIDR_EL1 and REVIDR_EL1 CPU identification registers to sysfs
      
       - Trapping of user space cache maintenance operations and emulation in
         the kernel (CPU errata workaround)
      
       - Clean-up of the early page tables creation (kernel linear mapping,
         EFI run-time maps) to avoid splitting larger blocks (e.g.  pmds) into
         smaller ones (e.g.  ptes)
      
       - VDSO support for CLOCK_MONOTONIC_RAW in clock_gettime()
      
       - ARCH_HAS_KCOV enabled for arm64
      
       - Optimise IP checksum helpers
      
       - SWIOTLB optimisation to only allocate/initialise the buffer if the
         available RAM is beyond the 32-bit mask
      
       - Properly handle the "nosmp" command line argument
      
       - Fix for the initialisation of the CPU debug state during early boot
      
       - vdso-offsets.h build dependency workaround
      
       - Build fix when RANDOMIZE_BASE is enabled with MODULES off
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
        arm64: arm: Fix-up the removal of the arm64 regs_query_register_name() prototype
        arm64: Only select ARM64_MODULE_PLTS if MODULES=y
        arm64: mm: run pgtable_page_ctor() on non-swapper translation table pages
        arm64: mm: make create_mapping_late() non-allocating
        arm64: Honor nosmp kernel command line option
        arm64: Fix incorrect per-cpu usage for boot CPU
        arm64: kprobes: Add KASAN instrumentation around stack accesses
        arm64: kprobes: Cleanup jprobe_return
        arm64: kprobes: Fix overflow when saving stack
        arm64: kprobes: WARN if attempting to step with PSTATE.D=1
        arm64: debug: remove unused local_dbg_{enable, disable} macros
        arm64: debug: remove redundant spsr manipulation
        arm64: debug: unmask PSTATE.D earlier
        arm64: localise Image objcopy flags
        arm64: ptrace: remove extra define for CPSR's E bit
        kprobes: Add arm64 case in kprobe example module
        arm64: Add kernel return probes support (kretprobes)
        arm64: Add trampoline code for kretprobes
        arm64: kprobes instruction simulation support
        arm64: Treat all entry code as non-kprobe-able
        ...
      e831101a
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · f9abf53a
      Linus Torvalds 提交于
      Pull tile architecture updates from Chris Metcalf:
       "A few stray changes"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
        tile: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
        tile: support gcc 7 optimization to use __multi3
        tile 32-bit big-endian: fix bugs in syscall argument order
        tile: allow disabling CONFIG_EARLY_PRINTK
      f9abf53a
    • L
      Merge tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm · ba4f6789
      Linus Torvalds 提交于
      Pull dlm updates from David Teigland:
       "This set includes two trivial changes, one to use kmemdup and another
        to control the log level of recovery messages"
      
      * tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
        dlm: Use kmemdup instead of kmalloc and memcpy
        dlm: add log_info config option
      ba4f6789
    • L
      Merge tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs · 4fc29c1a
      Linus Torvalds 提交于
      Pull f2fs updates from Jaegeuk Kim:
       "The major change in this version is mitigating cpu overheads on write
        paths by replacing redundant inode page updates with mark_inode_dirty
        calls.  And we tried to reduce lock contentions as well to improve
        filesystem scalability.  Other feature is setting F2FS automatically
        when detecting host-managed SMR.
      
        Enhancements:
         - ioctl to move a range of data between files
         - inject orphan inode errors
         - avoid flush commands congestion
         - support lazytime
      
        Bug fixes:
         - return proper results for some dentry operations
         - fix deadlock in add_link failure
         - disable extent_cache for fcollapse/finsert"
      
      * tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
        f2fs: clean up coding style and redundancy
        f2fs: get victim segment again after new cp
        f2fs: handle error case with f2fs_bug_on
        f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
        f2fs: support an ioctl to move a range of data blocks
        f2fs: fix to report error number of f2fs_find_entry
        f2fs: avoid memory allocation failure due to a long length
        f2fs: reset default idle interval value
        f2fs: use blk_plug in all the possible paths
        f2fs: fix to avoid data update racing between GC and DIO
        f2fs: add maximum prefree segments
        f2fs: disable extent_cache for fcollapse/finsert inodes
        f2fs: refactor __exchange_data_block for speed up
        f2fs: fix ERR_PTR returned by bio
        f2fs: avoid mark_inode_dirty
        f2fs: move i_size_write in f2fs_write_end
        f2fs: fix to avoid redundant discard during fstrim
        f2fs: avoid mismatching block range for discard
        f2fs: fix incorrect f_bfree calculation in ->statfs
        f2fs: use percpu_rw_semaphore
        ...
      4fc29c1a