1. 02 10月, 2018 1 次提交
    • S
      sched/numa: Stop multiple tasks from moving to the CPU at the same time · a4739eca
      Srikar Dronamraju 提交于
      Task migration under NUMA balancing can happen in parallel. More than
      one task might choose to migrate to the same CPU at the same time. This
      can result in:
      
      - During task swap, choosing a task that was not part of the evaluation.
      - During task swap, task which just got moved into its preferred node,
        moving to a completely different node.
      - During task swap, task failing to move to the preferred node, will have
        to wait an extra interval for the next migrate opportunity.
      - During task movement, multiple task movements can cause load imbalance.
      
      This problem is more likely if there are more cores per node or more
      nodes in the system.
      
      Use a per run-queue variable to check if NUMA-balance is active on the
      run-queue.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200194  203353   1.57797
      1     311331  328205   5.41995
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     197654  214384   8.46429
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     192605  188553   -2.10379
      1     213402  196273   -8.02664
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     52227.1  57581.2  10.2516
      1     102529   103468   0.915838
      
      There is a regression on power 9 box. If we look at the details,
      that box has a sudden jump in cache-misses with this patch.
      All other parameters seem to be pointing towards NUMA
      consolidation.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,345,784      13,941,377
      migrations                1,127,820       1,157,323
      faults                    374,736         382,175
      cache-misses              55,132,054,603  54,993,823,500
      sched:sched_move_numa     1,923           2,005
      sched:sched_stick_numa    52              14
      sched:sched_swap_numa     595             529
      migrate:mm_migrate_pages  1,932           1,573
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        60605   67099
      numa_hint_faults_local  51804   58456
      numa_hit                239945  240416
      numa_huge_pte_updates   14      18
      numa_interleave         60      65
      numa_local              239865  240339
      numa_other              80      77
      numa_pages_migrated     1931    1574
      numa_pte_updates        67823   77182
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,016,467       3,176,453
      migrations                37,326          30,238
      faults                    115,342         87,869
      cache-misses              11,692,155,554  12,544,479,391
      sched:sched_move_numa     965             23
      sched:sched_stick_numa    8               0
      sched:sched_swap_numa     35              6
      migrate:mm_migrate_pages  1,168           10
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        16286   236
      numa_hint_faults_local  11863   201
      numa_hit                112482  72293
      numa_huge_pte_updates   33      0
      numa_interleave         20      26
      numa_local              112419  72233
      numa_other              63      60
      numa_pages_migrated     1144    8
      numa_pte_updates        32859   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,629,724    8,478,820
      migrations                221,052      171,323
      faults                    308,661      307,499
      cache-misses              135,574,913  240,353,599
      sched:sched_move_numa     147          214
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            4
      migrate:mm_migrate_pages  64           89
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11481   5301
      numa_hint_faults_local  10968   4745
      numa_hit                89773   92943
      numa_huge_pte_updates   0       0
      numa_interleave         1116    899
      numa_local              89220   92345
      numa_other              553     598
      numa_pages_migrated     62      88
      numa_pte_updates        11694   5505
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,272,887  2,066,172
      migrations                12,206     11,076
      faults                    163,704    149,544
      cache-misses              4,801,186  10,398,067
      sched:sched_move_numa     44         43
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  17         6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        2261    3552
      numa_hint_faults_local  1993    3347
      numa_hit                25726   25611
      numa_huge_pte_updates   0       0
      numa_interleave         239     213
      numa_local              25498   25583
      numa_other              228     28
      numa_pages_migrated     17      6
      numa_pte_updates        2266    3535
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        117,980,962      99,358,136
      migrations                3,950,220        4,041,607
      faults                    736,979          749,653
      cache-misses              224,976,072,879  225,562,543,251
      sched:sched_move_numa     504              771
      sched:sched_stick_numa    50               14
      sched:sched_swap_numa     239              204
      migrate:mm_migrate_pages  1,260            1,180
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        18293   27409
      numa_hint_faults_local  11969   20677
      numa_hit                240854  239988
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240851  239983
      numa_other              3       5
      numa_pages_migrated     1190    1016
      numa_pte_updates        18106   27916
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        61,053,158      60,899,307
      migrations                551,586         544,668
      faults                    244,174         270,834
      cache-misses              74,326,766,973  74,543,455,635
      sched:sched_move_numa     344             735
      sched:sched_stick_numa    24              25
      sched:sched_swap_numa     140             174
      migrate:mm_migrate_pages  568             816
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        6461    11059
      numa_hint_faults_local  2283    4733
      numa_hit                35661   41384
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35661   41383
      numa_other              0       1
      numa_pages_migrated     568     815
      numa_pte_updates        6518    11323
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4739eca
  2. 29 9月, 2018 1 次提交
  3. 26 9月, 2018 1 次提交
  4. 22 9月, 2018 2 次提交
  5. 21 9月, 2018 2 次提交
  6. 18 9月, 2018 1 次提交
  7. 13 9月, 2018 2 次提交
  8. 11 9月, 2018 1 次提交
  9. 10 9月, 2018 12 次提交
    • Y
      perf/core: Force USER_DS when recording user stack data · 02e18447
      Yabin Cui 提交于
      Perf can record user stack data in response to a synchronous request, such
      as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
      end up reading user stack data using __copy_from_user_inatomic() under
      set_fs(KERNEL_DS). I think this conflicts with the intention of using
      set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
      when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.
      
      So fix this by forcing USER_DS when recording user stack data.
      Signed-off-by: NYabin Cui <yabinc@google.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 88b0193d ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
      Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      02e18447
    • C
      locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic" · 0b405c65
      Colin Ian King 提交于
      Trivial fix to spelling mistake in pr_err() error message
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-janitors@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b405c65
    • B
      locking/lockdep: Delete unnecessary #include · dc5591a0
      Ben Hutchings 提交于
      Commit:
      
        c3bc8fd6 ("tracing: Centralize preemptirq tracepoints and unify their usage")
      
      added the inclusion of <trace/events/preemptirq.h>.
      
      liblockdep doesn't have a stub version of that header so now fails to build.
      
      However, commit:
      
        bff1b208 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"")
      
      removed the use of functions declared in that header. So delete the #include.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <alexander.levin@verizon.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: bff1b208 ("tracing: Partial revert of "tracing: Centralize ...")
      Fixes: c3bc8fd6 ("tracing: Centralize preemptirq tracepoints ...")
      Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dc5591a0
    • R
      sched/fair: Fix kernel-doc notation warning · 882a78a9
      Randy Dunlap 提交于
      Fix kernel-doc warning for missing 'flags' parameter description:
      
      ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ea14b57e ("sched/cpufreq: Provide migration hint")
      Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      882a78a9
    • B
      jump_label: Fix typo in warning message · da260fe1
      Borislav Petkov 提交于
      There's no 'allocatote' - use the next best thing: 'allocate' :-)
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180907103521.31344-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      da260fe1
    • V
      sched/fair: Fix load_balance redo for !imbalance · bb3485c8
      Vincent Guittot 提交于
      It can happen that load_balance() finds a busiest group and then a
      busiest rq but the calculated imbalance is in fact 0.
      
      In such situation, detach_tasks() returns immediately and lets the
      flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
      have pinned tasks and removed from the load balance mask. then, we
      redo a load balance without the busiest CPU. This creates wrong load
      balance situation and generates wrong task migration.
      
      If the calculated imbalance is 0, it's useless to try to find a
      busiest rq as no task will be migrated and we can return immediately.
      
      This situation can happen with heterogeneous system or smp system when
      RT tasks are decreasing the capacity of some CPUs.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: jhugo@codeaurora.org
      Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bb3485c8
    • V
      sched/fair: Fix scale_rt_capacity() for SMT · 287cdaac
      Vincent Guittot 提交于
      Since commit:
      
        523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      
      scale_rt_capacity() returns the remaining capacity and not a scale factor
      to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
      scale_rt_capacity() so we must take the sched_domain argument.
      Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      287cdaac
    • S
      sched/fair: Fix vruntime_normalized() for remote non-migration wakeup · d0cdb3ce
      Steve Muckle 提交于
      When a task which previously ran on a given CPU is remotely queued to
      wake up on that same CPU, there is a period where the task's state is
      TASK_WAKING and its vruntime is not normalized. This is not accounted
      for in vruntime_normalized() which will cause an error in the task's
      vruntime if it is switched from the fair class during this time.
      
      For example if it is boosted to RT priority via rt_mutex_setprio(),
      rq->min_vruntime will not be subtracted from the task's vruntime but
      it will be added again when the task returns to the fair class. The
      task's vruntime will have been erroneously doubled and the effective
      priority of the task will be reduced.
      
      Note this will also lead to inflation of all vruntimes since the doubled
      vruntime value will become the rq's min_vruntime when other tasks leave
      the rq. This leads to repeated doubling of the vruntime and priority
      penalty.
      
      Fix this by recognizing a WAKING task's vruntime as normalized only if
      sched_remote_wakeup is true. This indicates a migration, in which case
      the vruntime would have been normalized in migrate_task_rq_fair().
      
      Based on a similar patch from John Dias <joaodias@google.com>.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NSteve Muckle <smuckle@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel de Dios <migueldedios@google.com>
      Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
      Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: kernel-team@android.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d0cdb3ce
    • V
      sched/pelt: Fix update_blocked_averages() for RT and DL classes · 12b04875
      Vincent Guittot 提交于
      update_blocked_averages() is called to periodiccally decay the stalled load
      of idle CPUs and to sync all loads before running load balance.
      
      When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
      in order to potentially pull tasks and to use this newly idle CPU. This
      load balance happens whereas prev task from another class has not been put
      and its utilization updated yet. This may lead to wrongly account running
      time as idle time for RT or DL classes.
      
      Test that no RT or DL task is running when updating their utilization in
      update_blocked_averages().
      
      We still update RT and DL utilization instead of simply skipping them to
      make sure that all metrics are synced when used during load balance.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
      Fixes: 3727e0e1 ("sched/dl: Add dl_rq utilization tracking")
      Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12b04875
    • S
      sched/topology: Set correct NUMA topology type · e5e96faf
      Srikar Dronamraju 提交于
      With the following commit:
      
        051f3ca0 ("sched/topology: Introduce NUMA identity node sched domain")
      
      the scheduler introduced a new NUMA level. However this leads to the NUMA topology
      on 2 node systems to not be marked as NUMA_DIRECT anymore.
      
      After this commit, it gets reported as NUMA_BACKPLANE, because
      sched_domains_numa_level is now 2 on 2 node systems.
      
      Fix this by allowing setting systems that have up to 2 NUMA levels as
      NUMA_DIRECT.
      
      While here remove code that assumes that level can be 0.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andre Wild <wild@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Fixes: 051f3ca0 "Introduce NUMA identity node sched domain"
      Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e5e96faf
    • J
      sched/debug: Fix potential deadlock when writing to sched_features · e73e8197
      Jiada Wang 提交于
      The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.18.0-rc6-00152-gcd3f77d7-dirty #18 Not tainted
        ------------------------------------------------------
        sh/3358 is trying to acquire lock:
        000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
        but task is already holding lock:
        00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        which lock already depends on the new lock.
        the existing dependency chain (in reverse order) is:
        -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
               lock_acquire+0xb8/0x148
               down_write+0xac/0x140
               start_creating+0x5c/0x168
               debugfs_create_dir+0x18/0x220
               opp_debug_register+0x8c/0x120
               _add_opp_dev+0x104/0x1f8
               dev_pm_opp_get_opp_table+0x174/0x340
               _of_add_opp_table_v2+0x110/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #2 (opp_table_lock){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               _of_add_opp_table_v2+0xb4/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #1 (subsys mutex#6){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               subsys_interface_register+0xd8/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #0 (cpu_hotplug_lock.rw_sem){++++}:
               __lock_acquire+0x203c/0x21d0
               lock_acquire+0xb8/0x148
               cpus_read_lock+0x58/0x1c8
               static_key_enable+0x14/0x30
               sched_feat_write+0x314/0x428
               full_proxy_write+0xa0/0x138
               __vfs_write+0xd8/0x388
               vfs_write+0xdc/0x318
               ksys_write+0xb4/0x138
               sys_write+0xc/0x18
               __sys_trace_return+0x0/0x4
        other info that might help us debug this:
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
         Possible unsafe locking scenario:
               CPU0                    CPU1
               ----                    ----
          lock(&sb->s_type->i_mutex_key#3);
                                       lock(opp_table_lock);
                                       lock(&sb->s_type->i_mutex_key#3);
          lock(cpu_hotplug_lock.rw_sem);
         *** DEADLOCK ***
        2 locks held by sh/3358:
         #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
         #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        stack backtrace:
        CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d7-dirty #18
        Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
        Call trace:
         dump_backtrace+0x0/0x288
         show_stack+0x14/0x20
         dump_stack+0x13c/0x1ac
         print_circular_bug.isra.10+0x270/0x438
         check_prev_add.constprop.16+0x4dc/0xb98
         __lock_acquire+0x203c/0x21d0
         lock_acquire+0xb8/0x148
         cpus_read_lock+0x58/0x1c8
         static_key_enable+0x14/0x30
         sched_feat_write+0x314/0x428
         full_proxy_write+0xa0/0x138
         __vfs_write+0xd8/0x388
         vfs_write+0xdc/0x318
         ksys_write+0xb4/0x138
         sys_write+0xc/0x18
         __sys_trace_return+0x0/0x4
      
      This is because when loading the cpufreq_dt module we first acquire
      cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
      the &sb->s_type->i_mutex_key lock.
      
      But when writing to /sys/kernel/debug/sched_features, the
      cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.
      
      To fix this bug, reverse the lock acquisition order when writing to
      sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
      &sb->s_type->i_mutex_key.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJiada Wang <jiada_wang@mentor.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eugeniu Rosca <erosca@de.adit-jv.com>
      Cc: George G. Davis <george_davis@mentor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e73e8197
    • T
      locking/mutex: Fix mutex debug call and ww_mutex documentation · e13e2366
      Thomas Hellstrom 提交于
      The following commit:
      
        08295b3b ("Implement an algorithm choice for Wound-Wait mutexes")
      
      introduced a reference in the documentation to a function that was
      removed in an earlier commit.
      
      It also forgot to remove a call to debug_mutex_add_waiter() which is now
      unconditionally called by __mutex_add_waiter().
      
      Fix those bugs.
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dri-devel@lists.freedesktop.org
      Fixes: 08295b3b ("Implement an algorithm choice for Wound-Wait mutexes")
      Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e13e2366
  10. 07 9月, 2018 1 次提交
  11. 06 9月, 2018 3 次提交
    • S
      printk/tracing: Do not trace printk_nmi_enter() · d1c392c9
      Steven Rostedt (VMware) 提交于
      I hit the following splat in my tests:
      
      ------------[ cut here ]------------
      IRQs not enabled as expected
      WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c
      Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      EIP: tick_nohz_idle_enter+0x44/0x8c
      Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00
      75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0
      e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b
      EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007
      ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38
      DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296
      CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0
      Call Trace:
       do_idle+0x33/0x202
       cpu_startup_entry+0x61/0x63
       start_secondary+0x18e/0x1ed
       startup_32_smp+0x164/0x168
      irq event stamp: 18773830
      hardirqs last  enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10
      hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10
      softirqs last  enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf
      softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b
      ---[ end trace b7c64aa79e17954a ]---
      
      After a bit of debugging, I found what was happening. This would trigger
      when performing "perf" with a high NMI interrupt rate, while enabling and
      disabling function tracer. Ftrace uses breakpoints to convert the nops at
      the start of functions to calls to the function trampolines. The breakpoint
      traps disable interrupts and this makes calls into lockdep via the
      trace_hardirqs_off_thunk in the entry.S code. What happens is the following:
      
        do_idle {
      
          [interrupts enabled]
      
          <interrupt> [interrupts disabled]
      	TRACE_IRQS_OFF [lockdep says irqs off]
      	[...]
      	TRACE_IRQS_IRET
      	    test if pt_regs say return to interrupts enabled [yes]
      	    TRACE_IRQS_ON [lockdep says irqs are on]
      
      	    <nmi>
      		nmi_enter() {
      		    printk_nmi_enter() [traced by ftrace]
      		    [ hit ftrace breakpoint ]
      		    <breakpoint exception>
      			TRACE_IRQS_OFF [lockdep says irqs off]
      			[...]
      			TRACE_IRQS_IRET [return from breakpoint]
      			   test if pt_regs say interrupts enabled [no]
      			   [iret back to interrupt]
      	   [iret back to code]
      
          tick_nohz_idle_enter() {
      
      	lockdep_assert_irqs_enabled() [lockdep say no!]
      
      Although interrupts are indeed enabled, lockdep thinks it is not, and since
      we now do asserts via lockdep, it gives a false warning. The issue here is
      that printk_nmi_enter() is called before lockdep_off(), which disables
      lockdep (for this reason) in NMIs. By simply not allowing ftrace to see
      printk_nmi_enter() (via notrace annotation) we keep lockdep from getting
      confused.
      
      Cc: stable@vger.kernel.org
      Fixes: 42a0bb3f ("printk/nmi: generic solution for safe printk in NMI")
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d1c392c9
    • T
      cpu/hotplug: Prevent state corruption on error rollback · 69fa6eb7
      Thomas Gleixner 提交于
      When a teardown callback fails, the CPU hotplug code brings the CPU back to
      the previous state. The previous state becomes the new target state. The
      rollback happens in undo_cpu_down() which increments the state
      unconditionally even if the state is already the same as the target.
      
      As a consequence the next CPU hotplug operation will start at the wrong
      state. This is easily to observe when __cpu_disable() fails.
      
      Prevent the unconditional undo by checking the state vs. target before
      incrementing state and fix up the consequently wrong conditional in the
      unplug code which handles the failure of the final CPU take down on the
      control CPU side.
      
      Fixes: 4dddfb5f ("smp/hotplug: Rewrite AP state machine core")
      Reported-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Cc: josh@joshtriplett.org
      Cc: peterz@infradead.org
      Cc: jiangshanlai@gmail.com
      Cc: dzickus@redhat.com
      Cc: brendan.jackman@arm.com
      Cc: malat@debian.org
      Cc: sramana@codeaurora.org
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de
      
      ----
      69fa6eb7
    • N
      cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun() · f8b7530a
      Neeraj Upadhyay 提交于
      The smp_mb() in cpuhp_thread_fun() is misplaced. It needs to be after the
      load of st->should_run to prevent reordering of the later load/stores
      w.r.t. the load of st->should_run.
      
      Fixes: 4dddfb5f ("smp/hotplug: Rewrite AP state machine core")
      Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infraded.org>
      Cc: josh@joshtriplett.org
      Cc: peterz@infradead.org
      Cc: jiangshanlai@gmail.com
      Cc: dzickus@redhat.com
      Cc: brendan.jackman@arm.com
      Cc: malat@debian.org
      Cc: mojha@codeaurora.org
      Cc: sramana@codeaurora.org
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536126727-11629-1-git-send-email-neeraju@codeaurora.org
      f8b7530a
  12. 05 9月, 2018 1 次提交
  13. 03 9月, 2018 1 次提交
    • J
      bpf: avoid misuse of psock when TCP_ULP_BPF collides with another ULP · 597222f7
      John Fastabend 提交于
      Currently we check sk_user_data is non NULL to determine if the sk
      exists in a map. However, this is not sufficient to ensure the psock
      or the ULP ops are not in use by another user, such as kcm or TLS. To
      avoid this when adding a sock to a map also verify it is of the
      correct ULP type. Additionally, when releasing a psock verify that
      it is the TCP_ULP_BPF type before releasing the ULP. The error case
      where we abort an update due to ULP collision can cause this error
      path.
      
      For example,
      
        __sock_map_ctx_update_elem()
           [...]
           err = tcp_set_ulp_id(sock, TCP_ULP_BPF) <- collides with TLS
           if (err)                                <- so err out here
              goto out_free
           [...]
        out_free:
           smap_release_sock() <- calling tcp_cleanup_ulp releases the
                                  TLS ULP incorrectly.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      597222f7
  14. 01 9月, 2018 1 次提交
  15. 31 8月, 2018 5 次提交
  16. 30 8月, 2018 2 次提交
  17. 28 8月, 2018 3 次提交
    • J
      bpf: sockmap, decrement copied count correctly in redirect error case · 501ca817
      John Fastabend 提交于
      Currently, when a redirect occurs in sockmap and an error occurs in
      the redirect call we unwind the scatterlist once in the error path
      of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
      in the error path of sendmsg we decrement the copied count by the
      send size.
      
      However, its possible we partially sent data before the error was
      generated. This can happen if do_tcp_sendpages() partially sends the
      scatterlist before encountering a memory pressure error. If this
      happens we need to decrement the copied value (the value tracking
      how many bytes were actually sent to TCP stack) by the number of
      remaining bytes _not_ the entire send size. Otherwise we risk
      confusing userspace.
      
      Also we don't need two calls to free the scatterlist one is
      good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
      then properly reduce copied by the number of remaining bytes which
      may in fact be the entire send size if no bytes were sent.
      
      To do this use bool to indicate if free_start_sg() should do mem
      accounting or not.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      501ca817
    • D
      bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg · 15c480ef
      Daniel Borkmann 提交于
      In bpf_tcp_recvmsg() we first took a reference on the psock, however
      once we find that there are skbs in the normal socket's receive queue
      we return with processing them through tcp_recvmsg(). Problem is that
      we leak the taken reference on the psock in that path. Given we don't
      really do anything with the psock at this point, move the skb_queue_empty()
      test before we fetch the psock to fix this case.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      15c480ef
    • D
      bpf, sockmap: fix potential use after free in bpf_tcp_close · e06fa9c1
      Daniel Borkmann 提交于
      bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
      A parallel update on the sock hash map can happen between psock_map_pop()
      and lookup_elem_raw() where we override the element under link->hash /
      link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
      test whether an element is present, but we do not test whether the
      element is infact the element we were looking for.
      
      We lock the sock in bpf_tcp_close() during that time, so do we hold
      the lock in sock_hash_update_elem(). However, the latter locks the
      sock which is newly updated, not the one we're purging from the hash
      table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
      another CPU is doing the map update in parallel, dropped our sock from
      the hlist and released the psock.
      
      Subsequently the first CPU will find the new sock and attempts to drop
      and release the old sock yet another time. Fix is that we need to check
      the elements for a match after lookup, similar as we do in the sock map.
      Note that the hash tab elems are freed via RCU, so access to their
      link->hash / link->key is fine since we're under RCU read side there.
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e06fa9c1