1. 25 7月, 2018 13 次提交
    • S
      sched/numa: Modify migrate_swap() to accept additional parameters · 0ad4e3df
      Srikar Dronamraju 提交于
      There are checks in migrate_swap_stop() that check if the task/CPU
      combination is as per migrate_swap_arg before migrating.
      
      However atleast one of the two tasks to be swapped by migrate_swap() could
      have migrated to a completely different CPU before updating the
      migrate_swap_arg. The new CPU where the task is currently running could
      be a different node too. If the task has migrated, numa balancer might
      end up placing a task in a wrong node.  Instead of achieving node
      consolidation, it may end up spreading the load across nodes.
      
      To avoid that pass the CPUs as additional parameters.
      
      While here, place migrate_swap under CONFIG_NUMA_BALANCING.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25377.3     25226.6     -0.59
      1     72287       73326       1.437
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ad4e3df
    • S
      sched/numa: Remove unused task_capacity from 'struct numa_stats' · 10864a9e
      Srikar Dronamraju 提交于
      The task_capacity field in 'struct numa_stats' is redundant.
      Also move nr_running for better packing within the struct.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25308.6     25377.3     0.271
      1     72964       72287       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10864a9e
    • S
      sched/numa: Skip nodes that are at 'hoplimit' · 0ee7e74d
      Srikar Dronamraju 提交于
      When comparing two nodes at a distance of 'hoplimit', we should consider
      nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
      distance too. Hence two nodes at a distance of 'hoplimit' will have same
      groupweight. Fix this by skipping nodes at hoplimit.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25375.3     25308.6     -0.26
      1     72617       72964       0.477
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113372      108750      -4.07684
      1     177403      183115      3.21979
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      478.45      565.90      515.11       30.87
      numa01.sh       Sys:      207.79      271.04      232.94       21.33
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
      numa02.sh      Real:       60.00       61.46       60.78        0.49
      numa02.sh       Sys:       15.71       25.31       20.69        3.42
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82
      numa03.sh      Real:      776.42      834.85      806.01       23.22
      numa03.sh       Sys:      114.43      128.75      121.65        5.49
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
      numa04.sh      Real:      456.93      511.95      482.91       20.88
      numa04.sh       Sys:      178.09      460.89      356.86       94.58
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
      numa05.sh      Real:      393.98      493.48      436.61       35.59
      numa05.sh       Sys:      164.49      329.15      265.87       61.78
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
      numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
      numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
      numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
      numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
      numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
      numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
      numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
      numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
      numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
      numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
      numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
      numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
      numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
      numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
      
      While there is a regression with this change, this change is needed from a
      correctness perspective. Also it helps consolidation as seen from perf bench
      output.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee7e74d
    • S
      sched/debug: Reverse the order of printing faults · 67d9f6c2
      Srikar Dronamraju 提交于
      Fix the order in which the private and shared numa faults are getting
      printed.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25215.7     25375.3     0.63
      1     72107       72617       0.70
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67d9f6c2
    • S
      sched/numa: Use task faults only if numa_group is not yet set up · f03bb676
      Srikar Dronamraju 提交于
      When numa_group faults are available, task_numa_placement only uses
      numa_group faults to evaluate preferred node. However it still accounts
      task faults and even evaluates the preferred node just based on task
      faults just to discard it in favour of preferred node chosen on the
      basis of numa_group.
      
      Instead use task faults only if numa_group is not set.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25549.6     25215.7     -1.30
      1     73190       72107       -1.47
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113437      113372      -0.05
      1     196130      177403      -9.54
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      506.35      794.46      599.06      104.26
      numa01.sh       Sys:      150.37      223.56      195.99       24.94
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
      numa02.sh      Real:       60.33       62.40       61.31        0.90
      numa02.sh       Sys:       18.12       31.66       24.28        5.89
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98
      numa03.sh      Real:      696.47      853.62      745.80       57.28
      numa03.sh       Sys:       85.68      123.71       97.89       13.48
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
      numa04.sh      Real:      444.05      514.83      497.06       26.85
      numa04.sh       Sys:      230.39      375.79      316.23       48.58
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
      numa05.sh      Real:      423.09      460.41      439.57       13.92
      numa05.sh       Sys:      287.38      480.15      369.37       68.52
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
      numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
      numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
      numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
      numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
      numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
      numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
      numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
      numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
      numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f03bb676
    • S
      sched/numa: Set preferred_node based on best_cpu · 8cd45eee
      Srikar Dronamraju 提交于
      Currently preferred node is set to dst_nid which is the last node in the
      iteration whose group weight or task weight is greater than the current
      node. However it doesn't guarantee that dst_nid has the numa capacity
      to move. It also doesn't guarantee that dst_nid has the best_cpu which
      is the CPU/node ideal for node migration.
      
      Lets consider faults on a 4 node system with group weight numbers
      in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
      is running on 3 and 0 is its preferred node but its capacity is full.
      Consider nodes 1, 2 and 3 have capacity. Then the task should be
      migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
      points to the last node whose faults were greater than current node.
      
      Modify to set the preferred node based of best_cpu. Earlier setting
      preferred node was skipped if nr_active_nodes is 1. This could result in
      the task being moved out of the preferred node to a random node during
      regular load balancing.
      
      Also while modifying task_numa_migrate(), use sched_setnuma to set
      preferred node. This ensures out numa accounting is correct.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25122.9     25549.6     1.698
      1     73850       73190       -0.89
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     105930      113437      7.08676
      1     178624      196130      9.80047
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      435.78      653.81      534.58       83.20
      numa01.sh       Sys:      121.93      187.18      145.90       23.47
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
      numa02.sh      Real:       60.64       61.63       61.19        0.40
      numa02.sh       Sys:       14.72       25.68       19.06        4.03
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82
      numa03.sh      Real:      746.51      808.24      780.36       23.88
      numa03.sh       Sys:       97.26      108.48      105.07        4.28
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
      numa04.sh      Real:      465.97      519.27      484.81       19.62
      numa04.sh       Sys:      304.43      359.08      334.68       20.64
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
      numa05.sh      Real:      411.57      457.20      433.29       16.58
      numa05.sh       Sys:      230.05      435.48      339.95       67.58
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
      numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
      numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
      numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
      numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
      numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
      numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
      numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
      numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
      numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8cd45eee
    • S
      sched/numa: Simplify load_too_imbalanced() · 5f95ba7a
      Srikar Dronamraju 提交于
      Currently load_too_imbalance() cares about the slope of imbalance.
      It doesn't care of the direction of the imbalance.
      
      However this may not work if nodes that are being compared have
      dissimilar capacities. Few nodes might have more cores than other nodes
      in the system. Also unlike traditional load balance at a NUMA sched
      domain, multiple requests to migrate from the same source node to same
      destination node may run in parallel. This can cause huge load
      imbalance. This is specially true on a larger machines with either large
      cores per node or more number of nodes in the system. Hence allow
      move/swap only if the imbalance is going to reduce.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25058.2     25122.9     0.25
      1     72950       73850       1.23
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      516.14      892.41      739.84      151.32
      numa01.sh       Sys:      153.16      192.99      177.70       14.58
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
      numa02.sh      Real:       60.91       62.35       61.58        0.63
      numa02.sh       Sys:       16.47       26.16       21.20        3.85
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04
      numa03.sh      Real:      739.07      917.73      795.75       64.45
      numa03.sh       Sys:       94.46      136.08      109.48       14.58
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
      numa04.sh      Real:      442.61      715.43      530.31       96.12
      numa04.sh       Sys:      224.90      348.63      285.61       48.83
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
      numa05.sh      Real:      386.13      489.17      434.94       43.59
      numa05.sh       Sys:      144.29      438.56      278.80      105.78
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
      numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
      numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
      numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
      numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
      numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
      numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
      numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
      numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
      numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f95ba7a
    • S
      sched/numa: Evaluate move once per node · 305c1fac
      Srikar Dronamraju 提交于
      task_numa_compare() helps choose the best CPU to move or swap the
      selected task. To achieve this task_numa_compare() is called for every
      CPU in the node. Currently it evaluates if the task can be moved/swapped
      for each of the CPUs. However the move evaluation is mostly independent
      of the CPU. Evaluating the move logic once per node, provides scope for
      simplifying task_numa_compare().
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25705.2     25058.2     -2.51
      1     74433       72950       -1.99
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     96589.6     105930      9.670
      1     181830      178624      -1.76
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      440.65      941.32      758.98      189.17
      numa01.sh       Sys:      183.48      320.07      258.42       50.09
      numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
      numa02.sh      Real:       61.24       65.35       62.49        1.49
      numa02.sh       Sys:       16.83       24.18       21.40        2.60
      numa02.sh      User:     5219.59     5356.34     5264.03       49.07
      numa03.sh      Real:      822.04      912.40      873.55       37.35
      numa03.sh       Sys:      118.80      140.94      132.90        7.60
      numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
      numa04.sh      Real:      690.66      872.12      778.49       65.44
      numa04.sh       Sys:      459.26      563.03      494.03       42.39
      numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
      numa05.sh      Real:      418.37      562.28      525.77       54.27
      numa05.sh       Sys:      299.45      481.00      392.49       64.27
      numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
      numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
      numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
      numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
      numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
      numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
      numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
      numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
      numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
      numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      305c1fac
    • Y
      sched/debug: Show the sum wait time of a task group · 3d6c50c2
      Yun Wang 提交于
      Although we can rely on cpuacct to present the CPU usage of task
      groups, it is hard to tell how intense the competition is between
      these groups on CPU resources.
      
      Monitoring the wait time or sched_debug of each process could be
      very expensive, and there is no good way to accurately represent the
      conflict with these info, we need the wait time on group dimension.
      
      Thus we introduce group's wait_sum to represent the resource conflict
      between task groups, which is simply the sum of the wait time of
      the group's cfs_rq.
      
      The 'cpu.stat' is modified to show the statistic, like:
      
         nr_periods 0
         nr_throttled 0
         throttled_time 0
         wait_sum 2035098795584
      
      Now we can monitor the changes of wait_sum to tell how much a
      a task group is suffering in the fight of CPU resources.
      
      For example:
      
         (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
      
      means the task group paid X percentage of period on waiting
      for the CPU.
      Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3d6c50c2
    • V
      sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474
      Vincent Guittot 提交于
      Reuse cpu_util_irq() that has been defined for schedutil and set irq util
      to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
      
      But the compiler is not able to optimize the sequence (at least with
      aarch64 GCC 7.2.1):
      
      	free *= (max - irq);
      	free /= max;
      
      when irq is fixed to 0
      
      Add a new inline function scale_irq_capacity() that will scale utilization
      when irq is accounted. Reuse this funciton in schedutil which applies
      similar formula.
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2e62c474
    • H
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu 提交于
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: NHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3d133ee
    • D
      sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196
      Daniel Bristot de Oliveira 提交于
      Daniel Casini got this warn while running a DL task here at RetisLab:
      
        [  461.137582] ------------[ cut here ]------------
        [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
        [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
            [a ton of modules]
        [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
        [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
        [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
        [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
        [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
        [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
        [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
        [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
        [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
        [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
        [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
        [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
        [  461.137680] Call Trace:
        [  461.137684]  push_dl_task.part.46+0x3bc/0x460
        [  461.137686]  task_woken_dl+0x60/0x80
        [  461.137689]  ttwu_do_wakeup+0x4f/0x150
        [  461.137690]  ttwu_do_activate+0x77/0x80
        [  461.137692]  try_to_wake_up+0x1d6/0x4c0
        [  461.137693]  wake_up_q+0x32/0x70
        [  461.137696]  do_futex+0x7e7/0xb50
        [  461.137698]  __x64_sys_futex+0x8b/0x180
        [  461.137701]  do_syscall_64+0x5a/0x110
        [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [  461.137705] RIP: 0033:0x7efe4918ca26
        [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
        [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
        [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
        [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
        [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
        [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
        [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
        [  461.137743] ---[ end trace 187df4cad2bf7649 ]---
      
      This warning happened in the push_dl_task(), because
      __add_running_bw()->cpufreq_update_util() is getting the rq_clock of
      the later_rq before its update, which takes place at activate_task().
      The fix then is to update the rq_clock before calling add_running_bw().
      
      To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
      activate_task().
      Reported-by: NDaniel Casini <daniel.casini@santannapisa.it>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
      Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
      Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d7196
    • Y
      sched/topology: Check variable group before dereferencing it · 6cd0c583
      Yi Wang 提交于
      The 'group' variable in sched_domain_debug_one() is not checked
      when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
      but it might be NULL (it is checked later in the following while loop)
      and may cause NULL pointer dereference.
      
      We need to check it before using to avoid NULL dereference.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6cd0c583
  2. 16 7月, 2018 15 次提交
  3. 03 7月, 2018 6 次提交
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
    • V
      sched/util_est: Fix util_est_dequeue() for throttled cfs_rq · 3482d98b
      Vincent Guittot 提交于
      When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
      everything happens at cfs_rq level. Currently util_est stays unchanged
      in such case and it keeps accounting the utilization of throttled tasks.
      This can somewhat make sense as we don't dequeue tasks but only throttled
      cfs_rq.
      
      If a task of another group is enqueued/dequeued and root cfs_rq becomes
      idle during the dequeue, util_est will be cleared whereas it was
      accounting util_est of throttled tasks before. So the behavior of util_est
      is not always the same regarding throttled tasks and depends of side
      activity. Furthermore, util_est will not be updated when the cfs_rq is
      unthrottled as everything happens at cfs_rq level. Main results is that
      util_est will stay null whereas we now have running tasks. We have to wait
      for the next dequeue/enqueue of the previously throttled tasks to get an
      up to date util_est.
      
      Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
      if there is no running task so the util_est of a task remains until the
      latter is dequeued even if its cfs_rq has been throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 7f65ea42 ("sched/fair: Add util_est on top of PELT")
      Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3482d98b
    • X
      sched/fair: Advance global expiration when period timer is restarted · f1d1be8a
      Xunlei Pang 提交于
      When period gets restarted after some idle time, start_cfs_bandwidth()
      doesn't update the expiration information, expire_cfs_rq_runtime() will
      see cfs_rq->runtime_expires smaller than rq clock and go to the clock
      drift logic, wasting needless CPU cycles on the scheduler hot path.
      
      Update the global expiration in start_cfs_bandwidth() to avoid frequent
      expire_cfs_rq_runtime() calls once a new period begins.
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d1be8a
    • X
      sched/fair: Fix bandwidth timer clock drift condition · 512ac999
      Xunlei Pang 提交于
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      512ac999
    • V
      sched/rt: Fix call to cpufreq_update_util() · 296b2ffe
      Vincent Guittot 提交于
      With commit:
      
        8f111bc3 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
      
      the schedutil governor uses rq->rt.rt_nr_running to detect whether an
      RT task is currently running on the CPU and to set frequency to max
      if necessary.
      
      cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
      rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
      called so schedutil still considers that an RT task is running when the
      last task is dequeued. The update of rq->rt.rt_nr_running happens later
      in dequeue_rt_stack().
      
      In fact, we can take advantage of the sequence that the dequeue then
      re-enqueue rt entities when a rt task is enqueued or dequeued;
      As a result enqueue_top_rt_rq() is always called when a task is
      enqueued or dequeued and also when groups are throttled or unthrottled.
      The only place that not use enqueue_top_rt_rq() is when root rt_rq is
      throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: juri.lelli@redhat.com
      Cc: patrick.bellasi@arm.com
      Cc: viresh.kumar@linaro.org
      Fixes: 8f111bc3 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
      Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      296b2ffe
    • F
      sched/nohz: Skip remote tick on idle task entirely · d9c0ffca
      Frederic Weisbecker 提交于
      Some people have reported that the warning in sched_tick_remote()
      occasionally triggers, especially in favour of some RCU-Torture
      pressure:
      
      	WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
      	Modules linked in:
      	CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      	Workqueue: events_unbound sched_tick_remote
      	RIP: 0010:sched_tick_remote+0xb6/0xc0
      	Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
      	+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
      	Call Trace:
      	 process_one_work+0x1df/0x3b0
      	 worker_thread+0x44/0x3d0
      	 kthread+0xf3/0x130
      	 ? set_worker_desc+0xb0/0xb0
      	 ? kthread_create_worker_on_cpu+0x70/0x70
      	 ret_from_fork+0x35/0x40
      
      This happens when the remote tick applies on an idle task. Usually the
      idle_cpu() check avoids that, but it is performed before we lock the
      runqueue and it is therefore racy. It was intended to be that way in
      order to prevent from useless runqueue locks since idle task tick
      callback is a no-op.
      
      Now if the racy check slips out of our hands and we end up remotely
      ticking an idle task, the empty task_tick_idle() is harmless. Still
      it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
      not too far from curr->se.exec_start because update_curr_idle() doesn't
      update the exec_start value like other scheduler policies. Hence the
      reported false positive.
      
      So let's have another check, while the rq is locked, to make sure we
      don't remote tick on an idle task. The lockless idle_cpu() still applies
      to avoid unecessary rq lock contention.
      Reported-by: NJacek Tomaka <jacekt@dug.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9c0ffca
  4. 21 6月, 2018 2 次提交
  5. 20 6月, 2018 3 次提交
  6. 15 6月, 2018 1 次提交
    • M
      sched/core / kcov: avoid kcov_area during task switch · 0ed557aa
      Mark Rutland 提交于
      During a context switch, we first switch_mm() to the next task's mm,
      then switch_to() that new task.  This means that vmalloc'd regions which
      had previously been faulted in can transiently disappear in the context
      of the prev task.
      
      Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
      during this window, and as the fault handling code is instrumented, this
      results in a recursive fault.
      
      We must avoid accessing any kcov_area during this window.  We can do so
      with a new flag in kcov_mode, set prior to switching the mm, and cleared
      once the new task is live.  Since task_struct::kcov_mode isn't always a
      specific enum kcov_mode value, this is made an unsigned int.
      
      The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
      which are empty for !CONFIG_KCOV kernels.
      
      The code uses macros because I can't use static inline functions without a
      circular include dependency between <linux/sched.h> and <linux/kcov.h>,
      since the definition of task_struct uses things defined in <linux/kcov.h>
      
      Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed557aa