1. 25 7月, 2018 20 次提交
    • S
      sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() · b6a60cf3
      Srikar Dronamraju 提交于
      numa_migrate_preferred() is called periodically or when task preferred
      node changes. Preferred node evaluations happen once per scan sequence.
      
      If the scan completion happens just after the periodic NUMA migration,
      then we try to migrate to the preferred node and the preferred node might
      change, needing another node migration.
      
      Avoid this by checking for scan sequence completion only when checking
      for periodic migration.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25862.6     26158.1     1.14258
      1     74357       72725       -2.19482
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     117019      113992      -2.58
      1     179095      174947      -2.31
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      449.46      770.77      615.22      101.70
      numa01.sh       Sys:      132.72      208.17      170.46       24.96
      numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
      numa02.sh      Real:       60.85       61.79       61.28        0.37
      numa02.sh       Sys:       15.34       24.71       21.08        3.61
      numa02.sh      User:     5204.41     5249.85     5231.21       17.60
      numa03.sh      Real:      785.50      916.97      840.77       44.98
      numa03.sh       Sys:      108.08      133.60      119.43        8.82
      numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
      numa04.sh      Real:      429.57      587.37      480.80       57.40
      numa04.sh       Sys:      240.61      321.97      290.84       33.58
      numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
      numa05.sh      Real:      392.09      431.25      414.65       13.82
      numa05.sh       Sys:      229.41      372.48      297.54       53.14
      numa05.sh      User:    33390.86    34697.49    34222.43      556.42
      
      Testcase       Time:         Min         Max         Avg      StdDev 	%Change
      numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
      numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
      numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
      numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
      numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
      numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
      numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
      numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
      numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
      numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
      numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
      numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
      numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
      numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
      numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6a60cf3
    • S
      sched/numa: Use group_weights to identify if migration degrades locality · f35678b6
      Srikar Dronamraju 提交于
      On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
      consolidated to the closest group of nodes. In such a case, relying on
      group_fault metric may not always help to consolidate. There can always
      be a case where a node closer to the preferred node may have lesser
      faults than a node further away from the preferred node. In such a case,
      moving to node with more faults might avoid numa consolidation.
      
      Using group_weight would help to consolidate task/memory around the
      preferred_node.
      
      While here, to be on the conservative side, don't override migrate thread
      degrades locality logic for CPU_NEWLY_IDLE load balancing.
      
      Note: Similar problems exist with should_numa_migrate_memory and will be
      dealt separately.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25645.4     25960       1.22
      1     72142       73550       1.95
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     110199      120071      8.958
      1     176303      176249      -0.03
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      490.04      774.86      596.26       96.46
      numa01.sh       Sys:      151.52      242.88      184.82       31.71
      numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
      numa02.sh      Real:       60.14       62.94       60.98        1.00
      numa02.sh       Sys:       16.11       30.77       21.20        5.28
      numa02.sh      User:     5184.33     5311.09     5228.50       44.24
      numa03.sh      Real:      790.95      856.35      826.41       24.11
      numa03.sh       Sys:      114.93      118.85      117.05        1.63
      numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
      numa04.sh      Real:      434.37      597.92      504.87       59.70
      numa04.sh       Sys:      237.63      397.40      289.74       55.98
      numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
      numa05.sh      Real:      386.77      448.90      417.22       22.79
      numa05.sh       Sys:      149.23      379.95      303.04       79.55
      numa05.sh      User:    32951.76    35959.58    34562.18     1034.05
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
      numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
      numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
      numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
      numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
      numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
      numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
      numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
      numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
      numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
      numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
      numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
      numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
      numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
      numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f35678b6
    • S
      sched/numa: Update the scan period without holding the numa_group lock · 30619c89
      Srikar Dronamraju 提交于
      The metrics for updating scan periods are local or task specific.
      Currently this update happens under the numa_group lock, which seems
      unnecessary. Hence move this update outside the lock.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25355.9     25645.4     1.141
      1     72812       72142       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      30619c89
    • S
      sched/numa: Remove numa_has_capacity() · 2d4056fa
      Srikar Dronamraju 提交于
      task_numa_find_cpu() helps to find the CPU to swap/move the task to.
      It's guarded by numa_has_capacity(). However node not having capacity
      shouldn't deter a task swapping if it helps NUMA placement.
      
      Further load_too_imbalanced(), which evaluates possibilities of move/swap,
      provides similar checks as numa_has_capacity.
      
      Hence remove numa_has_capacity() to enhance possibilities of task
      swapping even if load is imbalanced.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25657.9     25804.1     0.569
      1     74435       73413       -1.37
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4056fa
    • S
      sched/numa: Modify migrate_swap() to accept additional parameters · 0ad4e3df
      Srikar Dronamraju 提交于
      There are checks in migrate_swap_stop() that check if the task/CPU
      combination is as per migrate_swap_arg before migrating.
      
      However atleast one of the two tasks to be swapped by migrate_swap() could
      have migrated to a completely different CPU before updating the
      migrate_swap_arg. The new CPU where the task is currently running could
      be a different node too. If the task has migrated, numa balancer might
      end up placing a task in a wrong node.  Instead of achieving node
      consolidation, it may end up spreading the load across nodes.
      
      To avoid that pass the CPUs as additional parameters.
      
      While here, place migrate_swap under CONFIG_NUMA_BALANCING.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25377.3     25226.6     -0.59
      1     72287       73326       1.437
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ad4e3df
    • S
      sched/numa: Remove unused task_capacity from 'struct numa_stats' · 10864a9e
      Srikar Dronamraju 提交于
      The task_capacity field in 'struct numa_stats' is redundant.
      Also move nr_running for better packing within the struct.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25308.6     25377.3     0.271
      1     72964       72287       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10864a9e
    • S
      sched/numa: Skip nodes that are at 'hoplimit' · 0ee7e74d
      Srikar Dronamraju 提交于
      When comparing two nodes at a distance of 'hoplimit', we should consider
      nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
      distance too. Hence two nodes at a distance of 'hoplimit' will have same
      groupweight. Fix this by skipping nodes at hoplimit.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25375.3     25308.6     -0.26
      1     72617       72964       0.477
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113372      108750      -4.07684
      1     177403      183115      3.21979
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      478.45      565.90      515.11       30.87
      numa01.sh       Sys:      207.79      271.04      232.94       21.33
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
      numa02.sh      Real:       60.00       61.46       60.78        0.49
      numa02.sh       Sys:       15.71       25.31       20.69        3.42
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82
      numa03.sh      Real:      776.42      834.85      806.01       23.22
      numa03.sh       Sys:      114.43      128.75      121.65        5.49
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
      numa04.sh      Real:      456.93      511.95      482.91       20.88
      numa04.sh       Sys:      178.09      460.89      356.86       94.58
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
      numa05.sh      Real:      393.98      493.48      436.61       35.59
      numa05.sh       Sys:      164.49      329.15      265.87       61.78
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
      numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
      numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
      numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
      numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
      numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
      numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
      numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
      numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
      numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
      numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
      numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
      numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
      numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
      numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
      
      While there is a regression with this change, this change is needed from a
      correctness perspective. Also it helps consolidation as seen from perf bench
      output.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee7e74d
    • S
      sched/debug: Reverse the order of printing faults · 67d9f6c2
      Srikar Dronamraju 提交于
      Fix the order in which the private and shared numa faults are getting
      printed.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25215.7     25375.3     0.63
      1     72107       72617       0.70
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67d9f6c2
    • S
      sched/numa: Use task faults only if numa_group is not yet set up · f03bb676
      Srikar Dronamraju 提交于
      When numa_group faults are available, task_numa_placement only uses
      numa_group faults to evaluate preferred node. However it still accounts
      task faults and even evaluates the preferred node just based on task
      faults just to discard it in favour of preferred node chosen on the
      basis of numa_group.
      
      Instead use task faults only if numa_group is not set.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25549.6     25215.7     -1.30
      1     73190       72107       -1.47
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113437      113372      -0.05
      1     196130      177403      -9.54
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      506.35      794.46      599.06      104.26
      numa01.sh       Sys:      150.37      223.56      195.99       24.94
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
      numa02.sh      Real:       60.33       62.40       61.31        0.90
      numa02.sh       Sys:       18.12       31.66       24.28        5.89
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98
      numa03.sh      Real:      696.47      853.62      745.80       57.28
      numa03.sh       Sys:       85.68      123.71       97.89       13.48
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
      numa04.sh      Real:      444.05      514.83      497.06       26.85
      numa04.sh       Sys:      230.39      375.79      316.23       48.58
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
      numa05.sh      Real:      423.09      460.41      439.57       13.92
      numa05.sh       Sys:      287.38      480.15      369.37       68.52
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
      numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
      numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
      numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
      numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
      numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
      numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
      numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
      numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
      numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f03bb676
    • S
      sched/numa: Set preferred_node based on best_cpu · 8cd45eee
      Srikar Dronamraju 提交于
      Currently preferred node is set to dst_nid which is the last node in the
      iteration whose group weight or task weight is greater than the current
      node. However it doesn't guarantee that dst_nid has the numa capacity
      to move. It also doesn't guarantee that dst_nid has the best_cpu which
      is the CPU/node ideal for node migration.
      
      Lets consider faults on a 4 node system with group weight numbers
      in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
      is running on 3 and 0 is its preferred node but its capacity is full.
      Consider nodes 1, 2 and 3 have capacity. Then the task should be
      migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
      points to the last node whose faults were greater than current node.
      
      Modify to set the preferred node based of best_cpu. Earlier setting
      preferred node was skipped if nr_active_nodes is 1. This could result in
      the task being moved out of the preferred node to a random node during
      regular load balancing.
      
      Also while modifying task_numa_migrate(), use sched_setnuma to set
      preferred node. This ensures out numa accounting is correct.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25122.9     25549.6     1.698
      1     73850       73190       -0.89
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     105930      113437      7.08676
      1     178624      196130      9.80047
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      435.78      653.81      534.58       83.20
      numa01.sh       Sys:      121.93      187.18      145.90       23.47
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
      numa02.sh      Real:       60.64       61.63       61.19        0.40
      numa02.sh       Sys:       14.72       25.68       19.06        4.03
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82
      numa03.sh      Real:      746.51      808.24      780.36       23.88
      numa03.sh       Sys:       97.26      108.48      105.07        4.28
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
      numa04.sh      Real:      465.97      519.27      484.81       19.62
      numa04.sh       Sys:      304.43      359.08      334.68       20.64
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
      numa05.sh      Real:      411.57      457.20      433.29       16.58
      numa05.sh       Sys:      230.05      435.48      339.95       67.58
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
      numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
      numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
      numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
      numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
      numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
      numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
      numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
      numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
      numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8cd45eee
    • S
      sched/numa: Simplify load_too_imbalanced() · 5f95ba7a
      Srikar Dronamraju 提交于
      Currently load_too_imbalance() cares about the slope of imbalance.
      It doesn't care of the direction of the imbalance.
      
      However this may not work if nodes that are being compared have
      dissimilar capacities. Few nodes might have more cores than other nodes
      in the system. Also unlike traditional load balance at a NUMA sched
      domain, multiple requests to migrate from the same source node to same
      destination node may run in parallel. This can cause huge load
      imbalance. This is specially true on a larger machines with either large
      cores per node or more number of nodes in the system. Hence allow
      move/swap only if the imbalance is going to reduce.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25058.2     25122.9     0.25
      1     72950       73850       1.23
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      516.14      892.41      739.84      151.32
      numa01.sh       Sys:      153.16      192.99      177.70       14.58
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
      numa02.sh      Real:       60.91       62.35       61.58        0.63
      numa02.sh       Sys:       16.47       26.16       21.20        3.85
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04
      numa03.sh      Real:      739.07      917.73      795.75       64.45
      numa03.sh       Sys:       94.46      136.08      109.48       14.58
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
      numa04.sh      Real:      442.61      715.43      530.31       96.12
      numa04.sh       Sys:      224.90      348.63      285.61       48.83
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
      numa05.sh      Real:      386.13      489.17      434.94       43.59
      numa05.sh       Sys:      144.29      438.56      278.80      105.78
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
      numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
      numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
      numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
      numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
      numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
      numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
      numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
      numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
      numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f95ba7a
    • S
      sched/numa: Evaluate move once per node · 305c1fac
      Srikar Dronamraju 提交于
      task_numa_compare() helps choose the best CPU to move or swap the
      selected task. To achieve this task_numa_compare() is called for every
      CPU in the node. Currently it evaluates if the task can be moved/swapped
      for each of the CPUs. However the move evaluation is mostly independent
      of the CPU. Evaluating the move logic once per node, provides scope for
      simplifying task_numa_compare().
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25705.2     25058.2     -2.51
      1     74433       72950       -1.99
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     96589.6     105930      9.670
      1     181830      178624      -1.76
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      440.65      941.32      758.98      189.17
      numa01.sh       Sys:      183.48      320.07      258.42       50.09
      numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
      numa02.sh      Real:       61.24       65.35       62.49        1.49
      numa02.sh       Sys:       16.83       24.18       21.40        2.60
      numa02.sh      User:     5219.59     5356.34     5264.03       49.07
      numa03.sh      Real:      822.04      912.40      873.55       37.35
      numa03.sh       Sys:      118.80      140.94      132.90        7.60
      numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
      numa04.sh      Real:      690.66      872.12      778.49       65.44
      numa04.sh       Sys:      459.26      563.03      494.03       42.39
      numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
      numa05.sh      Real:      418.37      562.28      525.77       54.27
      numa05.sh       Sys:      299.45      481.00      392.49       64.27
      numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
      numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
      numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
      numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
      numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
      numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
      numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
      numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
      numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
      numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      305c1fac
    • S
      sched/numa: Remove redundant field · 6e303967
      Srikar Dronamraju 提交于
      'numa_entry' is a struct list_head defined in task_struct, but never used.
      
      No functional change.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e303967
    • Y
      sched/debug: Show the sum wait time of a task group · 3d6c50c2
      Yun Wang 提交于
      Although we can rely on cpuacct to present the CPU usage of task
      groups, it is hard to tell how intense the competition is between
      these groups on CPU resources.
      
      Monitoring the wait time or sched_debug of each process could be
      very expensive, and there is no good way to accurately represent the
      conflict with these info, we need the wait time on group dimension.
      
      Thus we introduce group's wait_sum to represent the resource conflict
      between task groups, which is simply the sum of the wait time of
      the group's cfs_rq.
      
      The 'cpu.stat' is modified to show the statistic, like:
      
         nr_periods 0
         nr_throttled 0
         throttled_time 0
         wait_sum 2035098795584
      
      Now we can monitor the changes of wait_sum to tell how much a
      a task group is suffering in the fight of CPU resources.
      
      For example:
      
         (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
      
      means the task group paid X percentage of period on waiting
      for the CPU.
      Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3d6c50c2
    • V
      sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474
      Vincent Guittot 提交于
      Reuse cpu_util_irq() that has been defined for schedutil and set irq util
      to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
      
      But the compiler is not able to optimize the sequence (at least with
      aarch64 GCC 7.2.1):
      
      	free *= (max - irq);
      	free /= max;
      
      when irq is fixed to 0
      
      Add a new inline function scale_irq_capacity() that will scale utilization
      when irq is accounted. Reuse this funciton in schedutil which applies
      similar formula.
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2e62c474
    • I
      4765096f
    • H
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu 提交于
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: NHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3d133ee
    • D
      sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196
      Daniel Bristot de Oliveira 提交于
      Daniel Casini got this warn while running a DL task here at RetisLab:
      
        [  461.137582] ------------[ cut here ]------------
        [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
        [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
            [a ton of modules]
        [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
        [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
        [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
        [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
        [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
        [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
        [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
        [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
        [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
        [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
        [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
        [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
        [  461.137680] Call Trace:
        [  461.137684]  push_dl_task.part.46+0x3bc/0x460
        [  461.137686]  task_woken_dl+0x60/0x80
        [  461.137689]  ttwu_do_wakeup+0x4f/0x150
        [  461.137690]  ttwu_do_activate+0x77/0x80
        [  461.137692]  try_to_wake_up+0x1d6/0x4c0
        [  461.137693]  wake_up_q+0x32/0x70
        [  461.137696]  do_futex+0x7e7/0xb50
        [  461.137698]  __x64_sys_futex+0x8b/0x180
        [  461.137701]  do_syscall_64+0x5a/0x110
        [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [  461.137705] RIP: 0033:0x7efe4918ca26
        [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
        [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
        [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
        [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
        [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
        [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
        [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
        [  461.137743] ---[ end trace 187df4cad2bf7649 ]---
      
      This warning happened in the push_dl_task(), because
      __add_running_bw()->cpufreq_update_util() is getting the rq_clock of
      the later_rq before its update, which takes place at activate_task().
      The fix then is to update the rq_clock before calling add_running_bw().
      
      To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
      activate_task().
      Reported-by: NDaniel Casini <daniel.casini@santannapisa.it>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
      Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
      Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d7196
    • I
      stop_machine: Disable preemption after queueing stopper threads · 2610e889
      Isaac J. Manjarres 提交于
      This commit:
      
        9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      
      does not fully address the race condition that can occur
      as follows:
      
      On one CPU, call it CPU 3, thread 1 invokes
      cpu_stop_queue_two_works(2, 3,...), and the execution is such
      that thread 1 queues the works for migration/2 and migration/3,
      and is preempted after releasing the locks for migration/2 and
      migration/3, but before waking the threads.
      
      Then, On CPU 2, a kworker, call it thread 2, is running,
      and it invokes cpu_stop_queue_two_works(1, 2,...), such that
      thread 2 queues the works for migration/1 and migration/2.
      Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
      migration/2 and migration/3. This means that when CPU 2
      releases the locks for migration/1 and migration/2, but before
      it wakes those threads, it can be preempted by migration/2.
      
      If thread 2 is preempted by migration/2, then migration/2 will
      execute the first work item successfully, since migration/3
      was woken up by CPU 3, but when it goes to execute the second
      work item, it disables preemption, calls multi_cpu_stop(),
      and thus, CPU 2 will wait forever for migration/1, which should
      have been woken up by thread 2. However migration/1 cannot be
      woken up by thread 2, since it is a kworker, so it is affine to
      CPU 2, but CPU 2 is running migration/2 with preemption
      disabled, so thread 2 will never run.
      
      Disable preemption after queueing works for stopper threads
      to ensure that the operation of queueing the works and waking
      the stopper threads is atomic.
      Co-Developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: matt@codeblueprint.co.uk
      Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2610e889
    • Y
      sched/topology: Check variable group before dereferencing it · 6cd0c583
      Yi Wang 提交于
      The 'group' variable in sched_domain_debug_one() is not checked
      when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
      but it might be NULL (it is checked later in the following while loop)
      and may cause NULL pointer dereference.
      
      We need to check it before using to avoid NULL dereference.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6cd0c583
  2. 23 7月, 2018 4 次提交
    • L
      Linux 4.18-rc6 · d72e90f3
      Linus Torvalds 提交于
      d72e90f3
    • L
      Merge tag 'nvme-for-4.18' of git://git.infradead.org/nvme · 74413084
      Linus Torvalds 提交于
      Pull NVMe fixes from Christoph Hellwig:
      
       - fix a regression in 4.18 that causes a memory leak on probe failure
         (Keith Bush)
      
       - fix a deadlock in the passthrough ioctl code (Scott Bauer)
      
       - don't enable AENs if not supported (Weiping Zhang)
      
       - fix an old regression in metadata handling in the passthrough ioctl
         code (Roland Dreier)
      
      * tag 'nvme-for-4.18' of git://git.infradead.org/nvme:
        nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
        nvme: don't enable AEN if not supported
        nvme: ensure forward progress during Admin passthru
        nvme-pci: fix memory leak on probe failure
      74413084
    • L
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 165ea0d1
      Linus Torvalds 提交于
      Pull vfs fixes from Al Viro:
       "Fix several places that screw up cleanups after failures halfway
        through opening a file (one open-coding filp_clone_open() and getting
        it wrong, two misusing alloc_file()). That part is -stable fodder from
        the 'work.open' branch.
      
        And Christoph's regression fix for uapi breakage in aio series;
        include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
        definition of sigset_t, the reason for doing so in the first place had
        been bogus - there's no need to expose struct __aio_sigset in
        aio_abi.h at all"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        aio: don't expose __aio_sigset in uapi
        ocxlflash_getfile(): fix double-iput() on alloc_file() failures
        cxl_getfile(): fix double-iput() on alloc_file() failures
        drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
      165ea0d1
    • A
      alpha: fix osf_wait4() breakage · f88a333b
      Al Viro 提交于
      kernel_wait4() expects a userland address for status - it's only
      rusage that goes as a kernel one (and needs a copyout afterwards)
      
      [ Also, fix the prototype of kernel_wait4() to have that __user
        annotation   - Linus ]
      
      Fixes: 92ebce5a ("osf_wait4: switch to kernel_wait4()")
      Cc: stable@kernel.org # v4.13+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f88a333b
  3. 22 7月, 2018 16 次提交