1. 25 7月, 2018 14 次提交
    • S
      sched/numa: Modify migrate_swap() to accept additional parameters · 0ad4e3df
      Srikar Dronamraju 提交于
      There are checks in migrate_swap_stop() that check if the task/CPU
      combination is as per migrate_swap_arg before migrating.
      
      However atleast one of the two tasks to be swapped by migrate_swap() could
      have migrated to a completely different CPU before updating the
      migrate_swap_arg. The new CPU where the task is currently running could
      be a different node too. If the task has migrated, numa balancer might
      end up placing a task in a wrong node.  Instead of achieving node
      consolidation, it may end up spreading the load across nodes.
      
      To avoid that pass the CPUs as additional parameters.
      
      While here, place migrate_swap under CONFIG_NUMA_BALANCING.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25377.3     25226.6     -0.59
      1     72287       73326       1.437
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ad4e3df
    • S
      sched/numa: Remove unused task_capacity from 'struct numa_stats' · 10864a9e
      Srikar Dronamraju 提交于
      The task_capacity field in 'struct numa_stats' is redundant.
      Also move nr_running for better packing within the struct.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25308.6     25377.3     0.271
      1     72964       72287       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10864a9e
    • S
      sched/numa: Skip nodes that are at 'hoplimit' · 0ee7e74d
      Srikar Dronamraju 提交于
      When comparing two nodes at a distance of 'hoplimit', we should consider
      nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
      distance too. Hence two nodes at a distance of 'hoplimit' will have same
      groupweight. Fix this by skipping nodes at hoplimit.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25375.3     25308.6     -0.26
      1     72617       72964       0.477
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113372      108750      -4.07684
      1     177403      183115      3.21979
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      478.45      565.90      515.11       30.87
      numa01.sh       Sys:      207.79      271.04      232.94       21.33
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
      numa02.sh      Real:       60.00       61.46       60.78        0.49
      numa02.sh       Sys:       15.71       25.31       20.69        3.42
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82
      numa03.sh      Real:      776.42      834.85      806.01       23.22
      numa03.sh       Sys:      114.43      128.75      121.65        5.49
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
      numa04.sh      Real:      456.93      511.95      482.91       20.88
      numa04.sh       Sys:      178.09      460.89      356.86       94.58
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
      numa05.sh      Real:      393.98      493.48      436.61       35.59
      numa05.sh       Sys:      164.49      329.15      265.87       61.78
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
      numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
      numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
      numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
      numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
      numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
      numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
      numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
      numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
      numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
      numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
      numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
      numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
      numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
      numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
      
      While there is a regression with this change, this change is needed from a
      correctness perspective. Also it helps consolidation as seen from perf bench
      output.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee7e74d
    • S
      sched/debug: Reverse the order of printing faults · 67d9f6c2
      Srikar Dronamraju 提交于
      Fix the order in which the private and shared numa faults are getting
      printed.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25215.7     25375.3     0.63
      1     72107       72617       0.70
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67d9f6c2
    • S
      sched/numa: Use task faults only if numa_group is not yet set up · f03bb676
      Srikar Dronamraju 提交于
      When numa_group faults are available, task_numa_placement only uses
      numa_group faults to evaluate preferred node. However it still accounts
      task faults and even evaluates the preferred node just based on task
      faults just to discard it in favour of preferred node chosen on the
      basis of numa_group.
      
      Instead use task faults only if numa_group is not set.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25549.6     25215.7     -1.30
      1     73190       72107       -1.47
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113437      113372      -0.05
      1     196130      177403      -9.54
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      506.35      794.46      599.06      104.26
      numa01.sh       Sys:      150.37      223.56      195.99       24.94
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
      numa02.sh      Real:       60.33       62.40       61.31        0.90
      numa02.sh       Sys:       18.12       31.66       24.28        5.89
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98
      numa03.sh      Real:      696.47      853.62      745.80       57.28
      numa03.sh       Sys:       85.68      123.71       97.89       13.48
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
      numa04.sh      Real:      444.05      514.83      497.06       26.85
      numa04.sh       Sys:      230.39      375.79      316.23       48.58
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
      numa05.sh      Real:      423.09      460.41      439.57       13.92
      numa05.sh       Sys:      287.38      480.15      369.37       68.52
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
      numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
      numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
      numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
      numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
      numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
      numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
      numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
      numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
      numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f03bb676
    • S
      sched/numa: Set preferred_node based on best_cpu · 8cd45eee
      Srikar Dronamraju 提交于
      Currently preferred node is set to dst_nid which is the last node in the
      iteration whose group weight or task weight is greater than the current
      node. However it doesn't guarantee that dst_nid has the numa capacity
      to move. It also doesn't guarantee that dst_nid has the best_cpu which
      is the CPU/node ideal for node migration.
      
      Lets consider faults on a 4 node system with group weight numbers
      in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
      is running on 3 and 0 is its preferred node but its capacity is full.
      Consider nodes 1, 2 and 3 have capacity. Then the task should be
      migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
      points to the last node whose faults were greater than current node.
      
      Modify to set the preferred node based of best_cpu. Earlier setting
      preferred node was skipped if nr_active_nodes is 1. This could result in
      the task being moved out of the preferred node to a random node during
      regular load balancing.
      
      Also while modifying task_numa_migrate(), use sched_setnuma to set
      preferred node. This ensures out numa accounting is correct.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25122.9     25549.6     1.698
      1     73850       73190       -0.89
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     105930      113437      7.08676
      1     178624      196130      9.80047
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      435.78      653.81      534.58       83.20
      numa01.sh       Sys:      121.93      187.18      145.90       23.47
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
      numa02.sh      Real:       60.64       61.63       61.19        0.40
      numa02.sh       Sys:       14.72       25.68       19.06        4.03
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82
      numa03.sh      Real:      746.51      808.24      780.36       23.88
      numa03.sh       Sys:       97.26      108.48      105.07        4.28
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
      numa04.sh      Real:      465.97      519.27      484.81       19.62
      numa04.sh       Sys:      304.43      359.08      334.68       20.64
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
      numa05.sh      Real:      411.57      457.20      433.29       16.58
      numa05.sh       Sys:      230.05      435.48      339.95       67.58
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
      numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
      numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
      numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
      numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
      numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
      numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
      numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
      numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
      numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8cd45eee
    • S
      sched/numa: Simplify load_too_imbalanced() · 5f95ba7a
      Srikar Dronamraju 提交于
      Currently load_too_imbalance() cares about the slope of imbalance.
      It doesn't care of the direction of the imbalance.
      
      However this may not work if nodes that are being compared have
      dissimilar capacities. Few nodes might have more cores than other nodes
      in the system. Also unlike traditional load balance at a NUMA sched
      domain, multiple requests to migrate from the same source node to same
      destination node may run in parallel. This can cause huge load
      imbalance. This is specially true on a larger machines with either large
      cores per node or more number of nodes in the system. Hence allow
      move/swap only if the imbalance is going to reduce.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25058.2     25122.9     0.25
      1     72950       73850       1.23
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      516.14      892.41      739.84      151.32
      numa01.sh       Sys:      153.16      192.99      177.70       14.58
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
      numa02.sh      Real:       60.91       62.35       61.58        0.63
      numa02.sh       Sys:       16.47       26.16       21.20        3.85
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04
      numa03.sh      Real:      739.07      917.73      795.75       64.45
      numa03.sh       Sys:       94.46      136.08      109.48       14.58
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
      numa04.sh      Real:      442.61      715.43      530.31       96.12
      numa04.sh       Sys:      224.90      348.63      285.61       48.83
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
      numa05.sh      Real:      386.13      489.17      434.94       43.59
      numa05.sh       Sys:      144.29      438.56      278.80      105.78
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
      numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
      numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
      numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
      numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
      numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
      numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
      numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
      numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
      numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f95ba7a
    • S
      sched/numa: Evaluate move once per node · 305c1fac
      Srikar Dronamraju 提交于
      task_numa_compare() helps choose the best CPU to move or swap the
      selected task. To achieve this task_numa_compare() is called for every
      CPU in the node. Currently it evaluates if the task can be moved/swapped
      for each of the CPUs. However the move evaluation is mostly independent
      of the CPU. Evaluating the move logic once per node, provides scope for
      simplifying task_numa_compare().
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25705.2     25058.2     -2.51
      1     74433       72950       -1.99
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     96589.6     105930      9.670
      1     181830      178624      -1.76
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      440.65      941.32      758.98      189.17
      numa01.sh       Sys:      183.48      320.07      258.42       50.09
      numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
      numa02.sh      Real:       61.24       65.35       62.49        1.49
      numa02.sh       Sys:       16.83       24.18       21.40        2.60
      numa02.sh      User:     5219.59     5356.34     5264.03       49.07
      numa03.sh      Real:      822.04      912.40      873.55       37.35
      numa03.sh       Sys:      118.80      140.94      132.90        7.60
      numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
      numa04.sh      Real:      690.66      872.12      778.49       65.44
      numa04.sh       Sys:      459.26      563.03      494.03       42.39
      numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
      numa05.sh      Real:      418.37      562.28      525.77       54.27
      numa05.sh       Sys:      299.45      481.00      392.49       64.27
      numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
      numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
      numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
      numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
      numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
      numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
      numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
      numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
      numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
      numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      305c1fac
    • Y
      sched/debug: Show the sum wait time of a task group · 3d6c50c2
      Yun Wang 提交于
      Although we can rely on cpuacct to present the CPU usage of task
      groups, it is hard to tell how intense the competition is between
      these groups on CPU resources.
      
      Monitoring the wait time or sched_debug of each process could be
      very expensive, and there is no good way to accurately represent the
      conflict with these info, we need the wait time on group dimension.
      
      Thus we introduce group's wait_sum to represent the resource conflict
      between task groups, which is simply the sum of the wait time of
      the group's cfs_rq.
      
      The 'cpu.stat' is modified to show the statistic, like:
      
         nr_periods 0
         nr_throttled 0
         throttled_time 0
         wait_sum 2035098795584
      
      Now we can monitor the changes of wait_sum to tell how much a
      a task group is suffering in the fight of CPU resources.
      
      For example:
      
         (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
      
      means the task group paid X percentage of period on waiting
      for the CPU.
      Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3d6c50c2
    • V
      sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474
      Vincent Guittot 提交于
      Reuse cpu_util_irq() that has been defined for schedutil and set irq util
      to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
      
      But the compiler is not able to optimize the sequence (at least with
      aarch64 GCC 7.2.1):
      
      	free *= (max - irq);
      	free /= max;
      
      when irq is fixed to 0
      
      Add a new inline function scale_irq_capacity() that will scale utilization
      when irq is accounted. Reuse this funciton in schedutil which applies
      similar formula.
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2e62c474
    • H
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu 提交于
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: NHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3d133ee
    • D
      sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196
      Daniel Bristot de Oliveira 提交于
      Daniel Casini got this warn while running a DL task here at RetisLab:
      
        [  461.137582] ------------[ cut here ]------------
        [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
        [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
            [a ton of modules]
        [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
        [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
        [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
        [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
        [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
        [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
        [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
        [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
        [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
        [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
        [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
        [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
        [  461.137680] Call Trace:
        [  461.137684]  push_dl_task.part.46+0x3bc/0x460
        [  461.137686]  task_woken_dl+0x60/0x80
        [  461.137689]  ttwu_do_wakeup+0x4f/0x150
        [  461.137690]  ttwu_do_activate+0x77/0x80
        [  461.137692]  try_to_wake_up+0x1d6/0x4c0
        [  461.137693]  wake_up_q+0x32/0x70
        [  461.137696]  do_futex+0x7e7/0xb50
        [  461.137698]  __x64_sys_futex+0x8b/0x180
        [  461.137701]  do_syscall_64+0x5a/0x110
        [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [  461.137705] RIP: 0033:0x7efe4918ca26
        [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
        [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
        [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
        [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
        [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
        [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
        [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
        [  461.137743] ---[ end trace 187df4cad2bf7649 ]---
      
      This warning happened in the push_dl_task(), because
      __add_running_bw()->cpufreq_update_util() is getting the rq_clock of
      the later_rq before its update, which takes place at activate_task().
      The fix then is to update the rq_clock before calling add_running_bw().
      
      To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
      activate_task().
      Reported-by: NDaniel Casini <daniel.casini@santannapisa.it>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
      Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
      Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d7196
    • I
      stop_machine: Disable preemption after queueing stopper threads · 2610e889
      Isaac J. Manjarres 提交于
      This commit:
      
        9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      
      does not fully address the race condition that can occur
      as follows:
      
      On one CPU, call it CPU 3, thread 1 invokes
      cpu_stop_queue_two_works(2, 3,...), and the execution is such
      that thread 1 queues the works for migration/2 and migration/3,
      and is preempted after releasing the locks for migration/2 and
      migration/3, but before waking the threads.
      
      Then, On CPU 2, a kworker, call it thread 2, is running,
      and it invokes cpu_stop_queue_two_works(1, 2,...), such that
      thread 2 queues the works for migration/1 and migration/2.
      Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
      migration/2 and migration/3. This means that when CPU 2
      releases the locks for migration/1 and migration/2, but before
      it wakes those threads, it can be preempted by migration/2.
      
      If thread 2 is preempted by migration/2, then migration/2 will
      execute the first work item successfully, since migration/3
      was woken up by CPU 3, but when it goes to execute the second
      work item, it disables preemption, calls multi_cpu_stop(),
      and thus, CPU 2 will wait forever for migration/1, which should
      have been woken up by thread 2. However migration/1 cannot be
      woken up by thread 2, since it is a kworker, so it is affine to
      CPU 2, but CPU 2 is running migration/2 with preemption
      disabled, so thread 2 will never run.
      
      Disable preemption after queueing works for stopper threads
      to ensure that the operation of queueing the works and waking
      the stopper threads is atomic.
      Co-Developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: matt@codeblueprint.co.uk
      Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2610e889
    • Y
      sched/topology: Check variable group before dereferencing it · 6cd0c583
      Yi Wang 提交于
      The 'group' variable in sched_domain_debug_one() is not checked
      when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
      but it might be NULL (it is checked later in the following while loop)
      and may cause NULL pointer dereference.
      
      We need to check it before using to avoid NULL dereference.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6cd0c583
  2. 22 7月, 2018 3 次提交
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds 提交于
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  3. 18 7月, 2018 1 次提交
    • L
      Mark HI and TASKLET softirq synchronous · 3c53776e
      Linus Torvalds 提交于
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      Reported-and-tested-by: NHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: NJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c53776e
  4. 16 7月, 2018 16 次提交
  5. 15 7月, 2018 1 次提交
    • I
      stop_machine: Disable preemption when waking two stopper threads · 9fb8d5dc
      Isaac J. Manjarres 提交于
      When cpu_stop_queue_two_works() begins to wake the stopper threads, it does
      so without preemption disabled, which leads to the following race
      condition:
      
      The source CPU calls cpu_stop_queue_two_works(), with cpu1 as the source
      CPU, and cpu2 as the destination CPU. When adding the stopper threads to
      the wake queue used in this function, the source CPU stopper thread is
      added first, and the destination CPU stopper thread is added last.
      
      When wake_up_q() is invoked to wake the stopper threads, the threads are
      woken up in the order that they are queued in, so the source CPU's stopper
      thread is woken up first, and it preempts the thread running on the source
      CPU.
      
      The stopper thread will then execute on the source CPU, disable preemption,
      and begin executing multi_cpu_stop(), and wait for an ack from the
      destination CPU's stopper thread, with preemption still disabled. Since the
      worker thread that woke up the stopper thread on the source CPU is affine
      to the source CPU, and preemption is disabled on the source CPU, that
      thread will never run to dequeue the destination CPU's stopper thread from
      the wake queue, and thus, the destination CPU's stopper thread will never
      run, causing the source CPU's stopper thread to wait forever, and stall.
      
      Disable preemption when waking the stopper threads in
      cpu_stop_queue_two_works().
      
      Fixes: 0b26351b ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock")
      Co-Developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: matt@codeblueprint.co.uk
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1530655334-4601-1-git-send-email-isaacm@codeaurora.org
      9fb8d5dc
  6. 13 7月, 2018 2 次提交
    • J
      tracing: Reorder display of TGID to be after PID · f8494fa3
      Joel Fernandes (Google) 提交于
      Currently ftrace displays data in trace output like so:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID   CPU    TGID   ||||    TIMESTAMP  FUNCTION
                     | |       |      |    ||||       |         |
                  bash-1091  [000] ( 1091) d..2    28.313544: sched_switch:
      
      However Android's trace visualization tools expect a slightly different
      format due to an out-of-tree patch patch that was been carried for a
      decade, notice that the TGID and CPU fields are reversed:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID    TGID   CPU   ||||    TIMESTAMP  FUNCTION
                     | |        |      |   ||||       |         |
                  bash-1091  ( 1091) [002] d..2    64.965177: sched_switch:
      
      From kernel v4.13 onwards, during which TGID was introduced, tracing
      with systrace on all Android kernels will break (most Android kernels
      have been on 4.9 with Android patches, so this issues hasn't been seen
      yet). From v4.13 onwards things will break.
      
      The chrome browser's tracing tools also embed the systrace viewer which
      uses the legacy TGID format and updates to that are known to be
      difficult to make.
      
      Considering this, I suggest we make this change to the upstream kernel
      and backport it to all Android kernels. I believe this feature is merged
      recently enough into the upstream kernel that it shouldn't be a problem.
      Also logically, IMO it makes more sense to group the TGID with the
      TASK-PID and the CPU after these.
      
      Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
      
      Cc: jreck@google.com
      Cc: tkjos@google.com
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f8494fa3
    • D
      bpf: don't leave partial mangled prog in jit_subprogs error path · c7a89784
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following bug through fault injection:
      
        [...]
        [  141.043668] verifier bug. No program starts at insn 3
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
        [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
        [  141.049877] Call Trace:
        [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
        [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
        [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
        [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
        [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
        [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
        [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
        [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [...]
      
      What happens in jit_subprogs() is that kcalloc() for the subprog func
      buffer is failing with NULL where we then bail out. Latter is a plain
      return -ENOMEM, and this is definitely not okay since earlier in the
      loop we are walking all subprogs and temporarily rewrite insn->off to
      remember the subprog id as well as insn->imm to temporarily point the
      call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
      out in such state and handing this over to the interpreter is troublesome
      since later/subsequent e.g. find_subprog() lookups are based on wrong
      insn->imm.
      
      Therefore, once we hit this point, we need to jump to out_free path
      where we undo all changes from earlier loop, so that interpreter can
      work on unmodified insn->{off,imm}.
      
      Another point is that should find_subprog() fail in jit_subprogs() due
      to a verifier bug, then we also should not simply defer the program to
      the interpreter since also here we did partial modifications. Instead
      we should just bail out entirely and return an error to the user who is
      trying to load the program.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c7a89784
  7. 12 7月, 2018 2 次提交
  8. 11 7月, 2018 1 次提交
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0