1. 13 11月, 2019 1 次提交
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151
  2. 10 11月, 2019 1 次提交
    • Z
      tracing: Fix "gfp_t" format for synthetic events · fa18f803
      Zhengjun Xing 提交于
      [ Upstream commit 9fa8c9c647be624e91b09ecffa7cd97ee0600b40 ]
      
      In the format of synthetic events, the "gfp_t" is shown as "signed:1",
      but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".
      
      The issue can be reproduced by the following commands:
      
      echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events
      cat  /sys/kernel/debug/tracing/events/synthetic/memlatency/format
      
      name: memlatency
      ID: 2233
      format:
              field:unsigned short common_type;       offset:0;       size:2; signed:0;
              field:unsigned char common_flags;       offset:2;       size:1; signed:0;
              field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
              field:int common_pid;   offset:4;       size:4; signed:1;
      
              field:u64 lat;  offset:8;       size:8; signed:0;
              field:unsigned int order;       offset:16;      size:4; signed:0;
              field:gfp_t gfp_flags;  offset:24;      size:4; signed:1;
              field:int migratetype;  offset:32;      size:4; signed:1;
      
      print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype
      
      Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.comReviewed-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NZhengjun Xing <zhengjun.xing@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fa18f803
  3. 06 11月, 2019 2 次提交
    • P
      tracing: Initialize iter->seq after zeroing in tracing_read_pipe() · 394c90d9
      Petr Mladek 提交于
      [ Upstream commit d303de1fcf344ff7c15ed64c3f48a991c9958775 ]
      
      A customer reported the following softlockup:
      
      [899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
      [899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
      [899688.160002] RIP: 0010:up_write+0x1a/0x30
      [899688.160002] Kernel panic - not syncing: softlockup: hung tasks
      [899688.160002] RIP: 0010:up_write+0x1a/0x30
      [899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
      [899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
      [899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
      [899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
      [899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
      [899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
      [899688.160002]  tracing_read_pipe+0x336/0x3c0
      [899688.160002]  __vfs_read+0x26/0x140
      [899688.160002]  vfs_read+0x87/0x130
      [899688.160002]  SyS_read+0x42/0x90
      [899688.160002]  do_syscall_64+0x74/0x160
      
      It caught the process in the middle of trace_access_unlock(). There is
      no loop. So, it must be looping in the caller tracing_read_pipe()
      via the "waitagain" label.
      
      Crashdump analyze uncovered that iter->seq was completely zeroed
      at this point, including iter->seq.seq.size. It means that
      print_trace_line() was never able to print anything and
      there was no forward progress.
      
      The culprit seems to be in the code:
      
      	/* reset all but tr, trace, and overruns */
      	memset(&iter->seq, 0,
      	       sizeof(struct trace_iterator) -
      	       offsetof(struct trace_iterator, seq));
      
      It was added by the commit 53d0aa77 ("ftrace:
      add logic to record overruns"). It was v2.6.27-rc1.
      It was the time when iter->seq looked like:
      
           struct trace_seq {
      	unsigned char		buffer[PAGE_SIZE];
      	unsigned int		len;
           };
      
      There was no "size" variable and zeroing was perfectly fine.
      
      The solution is to reinitialize the structure after or without
      zeroing.
      
      Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.comSigned-off-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      394c90d9
    • F
      sched/vtime: Fix guest/system mis-accounting on task switch · 2cd003a8
      Frederic Weisbecker 提交于
      [ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ]
      
      vtime_account_system() assumes that the target task to account cputime
      to is always the current task. This is most often true indeed except on
      task switch where we call:
      
      	vtime_common_task_switch(prev)
      		vtime_account_system(prev)
      
      Here prev is the scheduling-out task where we account the cputime to. It
      doesn't match current that is already the scheduling-in task at this
      stage of the context switch.
      
      So we end up checking the wrong task flags to determine if we are
      accounting guest or system time to the previous task.
      
      As a result the wrong task is used to check if the target is running in
      guest mode. We may then spuriously account or leak either system or
      guest time on task switch.
      
      Fix this assumption and also turn vtime_guest_enter/exit() to use the
      task passed in parameter as well to avoid future similar issues.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Fixes: 2a42eb95 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
      Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2cd003a8
  4. 29 10月, 2019 2 次提交
    • P
      tracing: Fix race in perf_trace_buf initialization · 5ce7528c
      Prateek Sood 提交于
      commit 6b1340cc00edeadd52ebd8a45171f38c8de2a387 upstream.
      
      A race condition exists while initialiazing perf_trace_buf from
      perf_trace_init() and perf_kprobe_init().
      
            CPU0                                        CPU1
      perf_trace_init()
        mutex_lock(&event_mutex)
          perf_trace_event_init()
            perf_trace_event_reg()
              total_ref_count == 0
      	buf = alloc_percpu()
              perf_trace_buf[i] = buf
              tp_event->class->reg() //fails       perf_kprobe_init()
      	goto fail                              perf_trace_event_init()
                                                       perf_trace_event_reg()
              fail:
      	  total_ref_count == 0
      
                                                         total_ref_count == 0
                                                         buf = alloc_percpu()
                                                         perf_trace_buf[i] = buf
                                                         tp_event->class->reg()
                                                         total_ref_count++
      
                free_percpu(perf_trace_buf[i])
                perf_trace_buf[i] = NULL
      
      Any subsequent call to perf_trace_event_reg() will observe total_ref_count > 0,
      causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf
      getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring
      event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should
      fix this race.
      
      The race caused the following bug:
      
       Unable to handle kernel paging request at virtual address 0000003106f2003c
       Mem abort info:
         ESR = 0x96000045
         Exception class = DABT (current EL), IL = 32 bits
         SET = 0, FnV = 0
         EA = 0, S1PTW = 0
       Data abort info:
         ISV = 0, ISS = 0x00000045
         CM = 0, WnR = 1
       user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000
       [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000
       Internal error: Oops: 96000045 [#1] PREEMPT SMP
       Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000)
       pstate: 80400005 (Nzcv daif +PAN -UAO)
       pc : __memset+0x20/0x1ac
       lr : memset+0x3c/0x50
       sp : ffffffc09319fc50
      
        __memset+0x20/0x1ac
        perf_trace_buf_alloc+0x140/0x1a0
        perf_trace_sys_enter+0x158/0x310
        syscall_trace_enter+0x348/0x7c0
        el0_svc_common+0x11c/0x368
        el0_svc_handler+0x12c/0x198
        el0_svc+0x8/0xc
      
      Ramdumps showed the following:
        total_ref_count = 3
        perf_trace_buf = (
            0x0 -> NULL,
            0x0 -> NULL,
            0x0 -> NULL,
            0x0 -> NULL)
      
      Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org
      
      Cc: stable@vger.kernel.org
      Fixes: e12f03d7 ("perf/core: Implement the 'perf_kprobe' PMU")
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ce7528c
    • A
      perf/aux: Fix AUX output stopping · 96202569
      Alexander Shishkin 提交于
      commit f3a519e4add93b7b31a6616f0b09635ff2e6a159 upstream.
      
      Commit:
      
        8a58ddae2379 ("perf/core: Fix exclusive events' grouping")
      
      allows CAP_EXCLUSIVE events to be grouped with other events. Since all
      of those also happen to be AUX events (which is not the case the other
      way around, because arch/s390), this changes the rules for stopping the
      output: the AUX event may not be on its PMU's context any more, if it's
      grouped with a HW event, in which case it will be on that HW event's
      context instead. If that's the case, munmap() of the AUX buffer can't
      find and stop the AUX event, potentially leaving the last reference with
      the atomic context, which will then end up freeing the AUX buffer. This
      will then trip warnings:
      
      Fix this by using the context's PMU context when looking for events
      to stop, instead of the event's PMU context.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96202569
  5. 18 10月, 2019 7 次提交
  6. 12 10月, 2019 7 次提交
    • B
      tick: broadcast-hrtimer: Fix a race in bc_set_next · e5331c37
      Balasubramani Vivekanandan 提交于
      [ Upstream commit b9023b91dd020ad7e093baa5122b6968c48cc9e0 ]
      
      When a cpu requests broadcasting, before starting the tick broadcast
      hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
      using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
      the required synchronization when the callback is active on other core.
      
      The callback could have already executed tick_handle_oneshot_broadcast()
      and could have also returned. But still there is a small time window where
      the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
      without doing anything, but the next_event of the tick broadcast clock
      device is already set to a timeout value.
      
      In the race condition diagram below, CPU #1 is running the timer callback
      and CPU #2 is entering idle state and so calls bc_set_next().
      
      In the worst case, the next_event will contain an expiry time, but the
      hrtimer will not be started which happens when the racing callback returns
      HRTIMER_NORESTART. The hrtimer might never recover if all further requests
      from the CPUs to subscribe to tick broadcast have timeout greater than the
      next_event of tick broadcast clock device. This leads to cascading of
      failures and finally noticed as rcu stall warnings
      
      Here is a depiction of the race condition
      
      CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                        and subscribe to
                                                        tick broadcast)
      ---------------------                             ---------------------
      
      __run_hrtimer()                                   tick_broadcast_enter()
      
        bc_handler()                                      __tick_broadcast_oneshot_control()
      
          tick_handle_oneshot_broadcast()
      
            raw_spin_lock(&tick_broadcast_lock);
      
            dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
            //next_event for tick broadcast clock
            set to KTIME_MAX since no other cores
            subscribed to tick broadcasting
      
            raw_spin_unlock(&tick_broadcast_lock);
      
          if (dev->next_event == KTIME_MAX)
            return HRTIMER_NORESTART
          // callback function exits without
             restarting the hrtimer                      //tick_broadcast_lock acquired
                                                         raw_spin_lock(&tick_broadcast_lock);
      
                                                         tick_broadcast_set_event()
      
                                                           clockevents_program_event()
      
                                                             dev->next_event = expires;
      
                                                             bc_set_next()
      
                                                               hrtimer_try_to_cancel()
                                                               //returns -1 since the timer
                                                               callback is active. Exits without
                                                               restarting the timer
        cpu_base->running = NULL;
      
      The comment that hrtimer cannot be armed from within the callback is
      wrong. It is fine to start the hrtimer from within the callback. Also it is
      safe to start the hrtimer from the enter/exit idle code while the broadcast
      handler is active. The enter/exit idle code and the broadcast handler are
      synchronized using tick_broadcast_lock. So there is no need for the
      existing try to cancel logic. All this can be removed which will eliminate
      the race condition as well.
      
      Fixes: 5d1638ac ("tick: Introduce hrtimer based broadcast")
      Originally-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBalasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      e5331c37
    • V
      kernel/elfcore.c: include proper prototypes · b0aaf65b
      Valdis Kletnieks 提交于
      [ Upstream commit 0f74914071ab7e7b78731ed62bf350e3a344e0a5 ]
      
      When building with W=1, gcc properly complains that there's no prototypes:
      
        CC      kernel/elfcore.o
      kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
          7 | Elf_Half __weak elf_core_extra_phdrs(void)
            |                 ^~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
         12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
            |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
         17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
            |            ^~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
         22 | size_t __weak elf_core_extra_data_size(void)
            |               ^~~~~~~~~~~~~~~~~~~~~~~~
      
      Provide the include file so gcc is happy, and we don't have potential code drift
      
      Link: http://lkml.kernel.org/r/29875.1565224705@turing-policeSigned-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b0aaf65b
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 46ff0e2f
      KeMeng Shi 提交于
      [ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46ff0e2f
    • M
      sched/membarrier: Fix private expedited registration check · 6cb7aa1b
      Mathieu Desnoyers 提交于
      [ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ]
      
      Fix a logic flaw in the way membarrier_register_private_expedited()
      handles ready state checks for private expedited sync core and private
      expedited registrations.
      
      If a private expedited membarrier registration is first performed, and
      then a private expedited sync_core registration is performed, the ready
      state check will skip the second registration when it really should not.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cb7aa1b
    • W
      Revert "locking/pvqspinlock: Don't wait if vCPU is preempted" · e409b81d
      Wanpeng Li 提交于
      commit 89340d0935c9296c7b8222b6eab30e67cb57ab82 upstream.
      
      This patch reverts commit 75437bb3 (locking/pvqspinlock: Don't
      wait if vCPU is preempted).  A large performance regression was caused
      by this commit.  on over-subscription scenarios.
      
      The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads,
      with three VMs of 80 vCPUs each.  The score of ebizzy -M is reduced from
      13000-14000 records/s to 1700-1800 records/s:
      
                Host                Guest                score
      
      vanilla w/o kvm optimizations     upstream    1700-1800 records/s
      vanilla w/o kvm optimizations     revert      13000-14000 records/s
      vanilla w/ kvm optimizations      upstream    4500-5000 records/s
      vanilla w/ kvm optimizations      revert      14000-15500 records/s
      
      Exit from aggressive wait-early mechanism can result in premature yield
      and extra scheduling latency.
      
      Actually, only 6% of wait_early events are caused by vcpu_is_preempted()
      being true.  However, when one vCPU voluntarily releases its vCPU, all
      the subsequently waiters in the queue will do the same and the cascading
      effect leads to bad performance.
      
      kvm optimizations:
      [1] commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts)
      [2] commit 266e85a5ec9 (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption)
      
      Tested-by: loobinliu@tencent.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: loobinliu@tencent.com
      Cc: stable@vger.kernel.org
      Fixes: 75437bb3 (locking/pvqspinlock: Don't wait if vCPU is preempted)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e409b81d
    • L
      timer: Read jiffies once when forwarding base clk · 06f25021
      Li RongQing 提交于
      commit e430d802d6a3aaf61bd3ed03d9404888a29b9bf9 upstream.
      
      The timer delayed for more than 3 seconds warning was triggered during
      testing.
      
        Workqueue: events_unbound sched_tick_remote
        RIP: 0010:sched_tick_remote+0xee/0x100
        ...
        Call Trace:
         process_one_work+0x18c/0x3a0
         worker_thread+0x30/0x380
         kthread+0x113/0x130
         ret_from_fork+0x22/0x40
      
      The reason is that the code in collect_expired_timers() uses jiffies
      unprotected:
      
          if (next_event > jiffies)
              base->clk = jiffies;
      
      As the compiler is allowed to reload the value base->clk can advance
      between the check and the store and in the worst case advance farther than
      next event. That causes the timer expiry to be delayed until the wheel
      pointer wraps around.
      
      Convert the code to use READ_ONCE()
      
      Fixes: 23696838 ("timers: Optimize collect_expired_timers() for NOHZ")
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NLiang ZhiCheng <liangzhicheng@baidu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1568894687-14499-1-git-send-email-lirongqing@baidu.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06f25021
    • T
      tracing: Make sure variable reference alias has correct var_ref_idx · e010c983
      Tom Zanussi 提交于
      commit 17f8607a1658a8e70415eef67909f990d13017b5 upstream.
      
      Original changelog from Steve Rostedt (except last sentence which
      explains the problem, and the Fixes: tag):
      
      I performed a three way histogram with the following commands:
      
      echo 'irq_lat u64 lat pid_t pid' > synthetic_events
      echo 'wake_lat u64 lat u64 irqlat pid_t pid' >> synthetic_events
      echo 'hist:keys=common_pid:irqts=common_timestamp.usecs if function == 0xffffffff81200580' > events/timer/hrtimer_start/trigger
      echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$irqts:onmatch(timer.hrtimer_start).irq_lat($lat,pid) if common_flags & 1' > events/sched/sched_waking/trigger
      echo 'hist:keys=pid:wakets=common_timestamp.usecs,irqlat=lat' > events/synthetic/irq_lat/trigger
      echo 'hist:keys=next_pid:lat=common_timestamp.usecs-$wakets,irqlat=$irqlat:onmatch(synthetic.irq_lat).wake_lat($lat,$irqlat,next_pid)' > events/sched/sched_switch/trigger
      echo 1 > events/synthetic/wake_lat/enable
      
      Basically I wanted to see:
      
       hrtimer_start (calling function tick_sched_timer)
      
      Note:
      
        # grep tick_sched_timer /proc/kallsyms
      ffffffff81200580 t tick_sched_timer
      
      And save the time of that, and then record sched_waking if it is called
      in interrupt context and with the same pid as the hrtimer_start, it
      will record the latency between that and the waking event.
      
      I then look at when the task that is woken is scheduled in, and record
      the latency between the wakeup and the task running.
      
      At the end, the wake_lat synthetic event will show the wakeup to
      scheduled latency, as well as the irq latency in from hritmer_start to
      the wakeup. The problem is that I found this:
      
                <idle>-0     [007] d...   190.485261: wake_lat: lat=27 irqlat=190485230 pid=698
                <idle>-0     [005] d...   190.485283: wake_lat: lat=40 irqlat=190485239 pid=10
                <idle>-0     [002] d...   190.488327: wake_lat: lat=56 irqlat=190488266 pid=335
                <idle>-0     [005] d...   190.489330: wake_lat: lat=64 irqlat=190489262 pid=10
                <idle>-0     [003] d...   190.490312: wake_lat: lat=43 irqlat=190490265 pid=77
                <idle>-0     [005] d...   190.493322: wake_lat: lat=54 irqlat=190493262 pid=10
                <idle>-0     [005] d...   190.497305: wake_lat: lat=35 irqlat=190497267 pid=10
                <idle>-0     [005] d...   190.501319: wake_lat: lat=50 irqlat=190501264 pid=10
      
      The irqlat seemed quite large! Investigating this further, if I had
      enabled the irq_lat synthetic event, I noticed this:
      
                <idle>-0     [002] d.s.   249.429308: irq_lat: lat=164968 pid=335
                <idle>-0     [002] d...   249.429369: wake_lat: lat=55 irqlat=249429308 pid=335
      
      Notice that the timestamp of the irq_lat "249.429308" is awfully
      similar to the reported irqlat variable. In fact, all instances were
      like this. It appeared that:
      
        irqlat=$irqlat
      
      Wasn't assigning the old $irqlat to the new irqlat variable, but
      instead was assigning the $irqts to it.
      
      The issue is that assigning the old $irqlat to the new irqlat variable
      creates a variable reference alias, but the alias creation code
      forgets to make sure the alias uses the same var_ref_idx to access the
      reference.
      
      Link: http://lkml.kernel.org/r/1567375321.5282.12.camel@kernel.org
      
      Cc: Linux Trace Devel <linux-trace-devel@vger.kernel.org>
      Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 7e8b88a3 ("tracing: Add hist trigger support for variable reference aliases")
      Reported-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NTom Zanussi <zanussi@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e010c983
  7. 08 10月, 2019 3 次提交
    • T
      kexec: bail out upon SIGKILL when allocating memory. · d85bc11a
      Tetsuo Handa 提交于
      commit 7c3a6aedcd6aae0a32a527e68669f7dd667492d1 upstream.
      
      syzbot found that a thread can stall for minutes inside kexec_load() after
      that thread was killed by SIGKILL [1].  It turned out that the reproducer
      was trying to allocate 2408MB of memory using kimage_alloc_page() from
      kimage_load_normal_segment().  Let's check for SIGKILL before doing memory
      allocation.
      
      [1] https://syzkaller.appspot.com/bug?id=a0e3436829698d5824231251fad9d8e998f94f5e
      
      Link: http://lkml.kernel.org/r/993c9185-d324-2640-d061-bed2dd18b1f7@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: Nsyzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d85bc11a
    • D
      bpf: fix use after free in prog symbol exposure · ed568ca7
      Daniel Borkmann 提交于
      commit c751798aa224fadc5124b49eeb38fb468c0fa039 upstream.
      
      syzkaller managed to trigger the warning in bpf_jit_free() which checks via
      bpf_prog_kallsyms_verify_off() for potentially unlinked JITed BPF progs
      in kallsyms, and subsequently trips over GPF when walking kallsyms entries:
      
        [...]
        8021q: adding VLAN 0 to HW filter on device batadv0
        8021q: adding VLAN 0 to HW filter on device batadv0
        WARNING: CPU: 0 PID: 9869 at kernel/bpf/core.c:810 bpf_jit_free+0x1e8/0x2a0
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Workqueue: events bpf_prog_free_deferred
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x113/0x167 lib/dump_stack.c:113
         panic+0x212/0x40b kernel/panic.c:214
         __warn.cold.8+0x1b/0x38 kernel/panic.c:571
         report_bug+0x1a4/0x200 lib/bug.c:186
         fixup_bug arch/x86/kernel/traps.c:178 [inline]
         do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
         do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
         invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
        RIP: 0010:bpf_jit_free+0x1e8/0x2a0
        Code: 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 86 00 00 00 48 ba 00 02 00 00 00 00 ad de 0f b6 43 02 49 39 d6 0f 84 5f fe ff ff <0f> 0b e9 58 fe ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1
        RSP: 0018:ffff888092f67cd8 EFLAGS: 00010202
        RAX: 0000000000000007 RBX: ffffc90001947000 RCX: ffffffff816e9d88
        RDX: dead000000000200 RSI: 0000000000000008 RDI: ffff88808769f7f0
        RBP: ffff888092f67d00 R08: fffffbfff1394059 R09: fffffbfff1394058
        R10: fffffbfff1394058 R11: ffffffff89ca02c7 R12: ffffc90001947002
        R13: ffffc90001947020 R14: ffffffff881eca80 R15: ffff88808769f7e8
        BUG: unable to handle kernel paging request at fffffbfff400d000
        #PF error: [normal kernel read fault]
        PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 9f942067 PTE 0
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Workqueue: events bpf_prog_free_deferred
        RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:495 [inline]
        RIP: 0010:bpf_tree_comp kernel/bpf/core.c:558 [inline]
        RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
        RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
        RIP: 0010:bpf_prog_kallsyms_find+0x107/0x2e0 kernel/bpf/core.c:632
        Code: 00 f0 ff ff 44 38 c8 7f 08 84 c0 0f 85 fa 00 00 00 41 f6 45 02 01 75 02 0f 0b 48 39 da 0f 82 92 00 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 45 01 00 00 8b 03 48 c1 e0
        [...]
      
      Upon further debugging, it turns out that whenever we trigger this
      issue, the kallsyms removal in bpf_prog_ksym_node_del() was /skipped/
      but yet bpf_jit_free() reported that the entry is /in use/.
      
      Problem is that symbol exposure via bpf_prog_kallsyms_add() but also
      perf_event_bpf_event() were done /after/ bpf_prog_new_fd(). Once the
      fd is exposed to the public, a parallel close request came in right
      before we attempted to do the bpf_prog_kallsyms_add().
      
      Given at this time the prog reference count is one, we start to rip
      everything underneath us via bpf_prog_release() -> bpf_prog_put().
      The memory is eventually released via deferred free, so we're seeing
      that bpf_jit_free() has a kallsym entry because we added it from
      bpf_prog_load() but /after/ bpf_prog_put() from the remote CPU.
      
      Therefore, move both notifications /before/ we install the fd. The
      issue was never seen between bpf_prog_alloc_id() and bpf_prog_new_fd()
      because upon bpf_prog_get_fd_by_id() we'll take another reference to
      the BPF prog, so we're still holding the original reference from the
      bpf_prog_load().
      
      Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
      Fixes: 74451e66 ("bpf: make jited programs visible in traces")
      Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NZubin Mithra <zsm@chromium.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ed568ca7
    • M
      livepatch: Nullify obj->mod in klp_module_coming()'s error path · 0f0ced70
      Miroslav Benes 提交于
      [ Upstream commit 4ff96fb52c6964ad42e0a878be8f86a2e8052ddd ]
      
      klp_module_coming() is called for every module appearing in the system.
      It sets obj->mod to a patched module for klp_object obj. Unfortunately
      it leaves it set even if an error happens later in the function and the
      patched module is not allowed to be loaded.
      
      klp_is_object_loaded() uses obj->mod variable and could currently give a
      wrong return value. The bug is probably harmless as of now.
      Signed-off-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f0ced70
  8. 05 10月, 2019 11 次提交
    • T
      alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP · 3784576f
      Thadeu Lima de Souza Cascardo 提交于
      commit f18ddc13af981ce3c7b7f26925f099e7c6929aba upstream.
      
      ENOTSUPP is not supposed to be returned to userspace. This was found on an
      OpenPower machine, where the RTC does not support set_alarm.
      
      On that system, a clock_nanosleep(CLOCK_REALTIME_ALARM, ...) results in
      "524 Unknown error 524"
      
      Replace it with EOPNOTSUPP which results in the expected "95 Operation not
      supported" error.
      
      Fixes: 1c6b39ad (alarmtimers: Return -ENOTSUPP if no RTC device is present)
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190903171802.28314-1-cascardo@canonical.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3784576f
    • V
      printk: Do not lose last line in kmsg buffer dump · 40b07199
      Vincent Whitchurch 提交于
      commit c9dccacfccc72c32692eedff4a27a4b0833a2afd upstream.
      
      kmsg_dump_get_buffer() is supposed to select all the youngest log
      messages which fit into the provided buffer.  It determines the correct
      start index by using msg_print_text() with a NULL buffer to calculate
      the size of each entry.  However, when performing the actual writes,
      msg_print_text() only writes the entry to the buffer if the written len
      is lesser than the size of the buffer.  So if the lengths of the
      selected youngest log messages happen to precisely fill up the provided
      buffer, the last log message is not included.
      
      We don't want to modify msg_print_text() to fill up the buffer and start
      returning a length which is equal to the size of the buffer, since
      callers of its other users, such as kmsg_dump_get_line(), depend upon
      the current behaviour.
      
      Instead, fix kmsg_dump_get_buffer() to compensate for this.
      
      For example, with the following two final prints:
      
      [    6.427502] AAAAAAAAAAAAA
      [    6.427769] BBBBBBBB12345
      
      A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this
      patch:
      
       00000000: 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37  <0>[    6.522197
       00000010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a  ] AAAAAAAAAAAAA.
       00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      
      After this patch:
      
       00000000: 3c 30 3e 5b 20 20 20 20 36 2e 34 35 36 36 37 38  <0>[    6.456678
       00000010: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a  ] BBBBBBBB12345.
       00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      
      Link: http://lkml.kernel.org/r/20190711142937.4083-1-vincent.whitchurch@axis.com
      Fixes: e2ae715d ("kmsg - kmsg_dump() use iterator to receive log buffer content")
      To: rostedt@goodmis.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v3.5+
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40b07199
    • M
      kprobes: Prohibit probing on BUG() and WARN() address · fad90d4b
      Masami Hiramatsu 提交于
      [ Upstream commit e336b4027775cb458dc713745e526fa1a1996b2a ]
      
      Since BUG() and WARN() may use a trap (e.g. UD2 on x86) to
      get the address where the BUG() has occurred, kprobes can not
      do single-step out-of-line that instruction. So prohibit
      probing on such address.
      
      Without this fix, if someone put a kprobe on WARN(), the
      kernel will crash with invalid opcode error instead of
      outputing warning message, because kernel can not find
      correct bug address.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N . Rao <naveen.n.rao@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/156750890133.19112.3393666300746167111.stgit@devnote2Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fad90d4b
    • D
      sched/cpufreq: Align trace event behavior of fast switching · 01e8f487
      Douglas RAILLARD 提交于
      [ Upstream commit 77c84dd1881d0f0176cb678d770bfbda26c54390 ]
      
      Fast switching path only emits an event for the CPU of interest, whereas the
      regular path emits an event for all the CPUs that had their frequency changed,
      i.e. all the CPUs sharing the same policy.
      
      With the current behavior, looking at cpu_frequency event for a given CPU that
      is using the fast switching path will not give the correct frequency signal.
      Signed-off-by: NDouglas RAILLARD <douglas.raillard@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      01e8f487
    • T
      posix-cpu-timers: Sanitize bogus WARNONS · 8d5fccff
      Thomas Gleixner 提交于
      [ Upstream commit 692117c1f7a6770ed41dd8f277cd9fed1dfb16f1 ]
      
      Warning when p == NULL and then proceeding and dereferencing p does not
      make any sense as the kernel will crash with a NULL pointer dereference
      right away.
      
      Bailing out when p == NULL and returning an error code does not cure the
      underlying problem which caused p to be NULL. Though it might allow to
      do proper debugging.
      
      Same applies to the clock id check in set_process_cpu_timer().
      
      Clean them up and make them return without trying to do further damage.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Link: https://lkml.kernel.org/r/20190819143801.846497772@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      8d5fccff
    • P
      idle: Prevent late-arriving interrupts from disrupting offline · 51111023
      Peter Zijlstra 提交于
      [ Upstream commit e78a7614f3876ac649b3df608789cb6ef74d0480 ]
      
      Scheduling-clock interrupts can arrive late in the CPU-offline process,
      after idle entry and the subsequent call to cpuhp_report_idle_dead().
      Once execution passes the call to rcu_report_dead(), RCU is ignoring
      the CPU, which results in lockdep complaints when the interrupt handler
      uses RCU:
      
      ------------------------------------------------------------------------
      
      =============================
      WARNING: suspicious RCU usage
      5.2.0-rc1+ #681 Not tainted
      -----------------------------
      kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from offline CPU!
      rcu_scheduler_active = 2, debug_locks = 1
      no locks held by swapper/5/0.
      
      stack backtrace:
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       dump_stack+0x5e/0x8b
       trigger_load_balance+0xa8/0x390
       ? tick_sched_do_timer+0x60/0x60
       update_process_times+0x3b/0x50
       tick_sched_handle+0x2f/0x40
       tick_sched_timer+0x32/0x70
       __hrtimer_run_queues+0xd3/0x3b0
       hrtimer_interrupt+0x11d/0x270
       ? sched_clock_local+0xc/0x74
       smp_apic_timer_interrupt+0x79/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:delay_tsc+0x22/0x50
      Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
      RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
      RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
      RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
       cpuhp_report_idle_dead+0x31/0x60
       do_idle+0x1d5/0x200
       ? _raw_spin_unlock_irqrestore+0x2d/0x40
       cpu_startup_entry+0x14/0x20
       start_secondary+0x151/0x170
       secondary_startup_64+0xa4/0xb0
      
      ------------------------------------------------------------------------
      
      This happens rarely, but can be forced by happen more often by
      placing delays in cpuhp_report_idle_dead() following the call to
      rcu_report_dead().  With this in place, the following rcutorture
      scenario reproduces the problem within a few minutes:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"
      
      This commit uses the crude but effective expedient of moving the disabling
      of interrupts within the idle loop to precede the cpu_is_offline()
      check.  It also invokes tick_nohz_idle_stop_tick() instead of
      tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
      interrupt.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      51111023
    • P
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · 9addfbd4
      Phil Auld 提交于
      [ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ]
      
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      9addfbd4
    • J
      sched/deadline: Fix bandwidth accounting at all levels after offline migration · 0f308569
      Juri Lelli 提交于
      [ Upstream commit 59d06cea1198d665ba11f7e8c5f45b00ff2e4812 ]
      
      If a task happens to be throttled while the CPU it was running on gets
      hotplugged off, the bandwidth associated with the task is not correctly
      migrated with it when the replenishment timer fires (offline_migration).
      
      Fix things up, for this_bw, running_bw and total_bw, when replenishment
      timer fires and task is migrated (dl_task_offline_migration()).
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: rostedt@goodmis.org
      Cc: tj@kernel.org
      Cc: tommaso.cucinotta@santannapisa.it
      Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f308569
    • J
      sched/core: Fix CPU controller for !RT_GROUP_SCHED · f381d3d2
      Juri Lelli 提交于
      [ Upstream commit a07db5c0865799ebed1f88be0df50c581fb65029 ]
      
      On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
      move RT tasks between cgroups to which CPU controller has been attached;
      but it is oddly possible to first move tasks around and then make them
      RT (setschedule to FIFO/RR).
      
      E.g.:
      
        # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
        # chrt -fp 10 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        bash: echo: write error: Invalid argument
        # chrt -op 0 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        # chrt -fp 10 $$
        # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        2345
        2598
        # chrt -p 2345
        pid 2345's current scheduling policy: SCHED_FIFO
        pid 2345's current scheduling priority: 10
      
      Also, as Michal noted, it is currently not possible to enable CPU
      controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
      are any kernel RT threads in root cgroup, they can't be migrated to the
      newly created CPU controller's root in cgroup_update_dfl_csses()).
      
      Existing code comes with a comment saying the "we don't support RT-tasks
      being in separate groups". Such comment is however stale and belongs to
      pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
      !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
      are not performed at all in these cases.
      
      Make moving RT tasks between CPU controller groups viable by removing
      special case check for RT (and DEADLINE) tasks.
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f381d3d2
    • V
      sched/fair: Fix imbalance due to CPU affinity · 417cf53b
      Vincent Guittot 提交于
      [ Upstream commit f6cad8df6b30a5d2bbbd2e698f74b4cafb9fb82b ]
      
      The load_balance() has a dedicated mecanism to detect when an imbalance
      is due to CPU affinity and must be handled at parent level. In this case,
      the imbalance field of the parent's sched_group is set.
      
      The description of sg_imbalanced() gives a typical example of two groups
      of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
      group and 3 CPUs of the second group. Something like:
      
      	{ 0 1 2 3 } { 4 5 6 7 }
      	        *     * * *
      
      But the load_balance fails to fix this UC on my octo cores system
      made of 2 clusters of quad cores.
      
      Whereas the load_balance is able to detect that the imbalanced is due to
      CPU affinity, it fails to fix it because the imbalance field is cleared
      before letting parent level a chance to run. In fact, when the imbalance is
      detected, the load_balance reruns without the CPU with pinned tasks. But
      there is no other running tasks in the situation described above and
      everything looks balanced this time so the imbalance field is immediately
      cleared.
      
      The imbalance field should not be cleared if there is no other task to move
      when the imbalance is detected.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      417cf53b
    • P
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint · 7cebdfa6
      Paul E. McKenney 提交于
      [ Upstream commit 84ec3a0787086fcd25f284f59b3aa01fd6fc0a5d ]
      
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
      
      The TASKS03 and TREE04 rcutorture scenarios produce the following
      lockdep complaint:
      
      	WARNING: inconsistent lock state
      	5.2.0-rc1+ #513 Not tainted
      	--------------------------------
      	inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      	migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
      	(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
      	{IN-HARDIRQ-W} state was registered at:
      	  lock_acquire+0xb0/0x1c0
      	  _raw_spin_lock_irqsave+0x3c/0x50
      	  tick_broadcast_switch_to_oneshot+0xd/0x40
      	  tick_switch_to_oneshot+0x4f/0xd0
      	  hrtimer_run_queues+0xf3/0x130
      	  run_local_timers+0x1c/0x50
      	  update_process_times+0x1c/0x50
      	  tick_periodic+0x26/0xc0
      	  tick_handle_periodic+0x1a/0x60
      	  smp_apic_timer_interrupt+0x80/0x2a0
      	  apic_timer_interrupt+0xf/0x20
      	  _raw_spin_unlock_irqrestore+0x4e/0x60
      	  rcu_nocb_gp_kthread+0x15d/0x590
      	  kthread+0xf3/0x130
      	  ret_from_fork+0x3a/0x50
      	irq event stamp: 171
      	hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
      	hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
      	softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
      	softirqs last disabled at (0): [<0000000000000000>] 0x0
      
              [...]
      
      To reproduce, run the following rcutorture test:
      
       $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
      
      It turns out that tick_broadcast_offline() was an innocent bystander.
      After all, interrupts are supposed to be disabled throughout
      take_cpu_down(), and therefore should have been disabled upon entry to
      tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
      that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
      and leaving them enabled on return.
      
      Some debugging code showed that the culprit was sched_cpu_dying().
      It had irqs enabled after return from sched_tick_stop().  Which in turn
      had irqs enabled after return from cancel_delayed_work_sync().  Which is a
      wrapper around __cancel_work_timer().  Which can sleep in the case where
      something else is concurrently trying to cancel the same delayed work,
      and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
      idea when you are invoked from take_cpu_down(), regardless of the state
      you leave interrupts in upon return.
      
      Code inspection located no reason why the delayed work absolutely
      needed to be canceled from sched_tick_stop():  The work is not
      bound to the outgoing CPU by design, given that the whole point is
      to collect statistics without disturbing the outgoing CPU.
      
      This commit therefore simply drops the cancel_delayed_work_sync() from
      sched_tick_stop().  Instead, a new ->state field is added to the tick_work
      structure so that the delayed-work handler function sched_tick_remote()
      can avoid reposting itself.  A cpu_is_offline() check is also added to
      sched_tick_remote() to avoid mucking with the state of an offlined CPU
      (though it does appear safe to do so).  The sched_tick_start() and
      sched_tick_stop() functions also update ->state, and sched_tick_start()
      also schedules the delayed work if ->state indicates that it is not
      already in flight.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7cebdfa6
  9. 01 10月, 2019 1 次提交
  10. 21 9月, 2019 1 次提交
    • M
      kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol · 9a74f799
      Marc Zyngier 提交于
      [ Upstream commit 2a1a3fa0f29270583f0e6e3100d609e09697add1 ]
      
      An arm64 kernel configured with
      
        CONFIG_KPROBES=y
        CONFIG_KALLSYMS=y
        # CONFIG_KALLSYMS_ALL is not set
        CONFIG_KALLSYMS_BASE_RELATIVE=y
      
      reports the following kprobe failure:
      
        [    0.032677] kprobes: failed to populate blacklist: -22
        [    0.033376] Please take care of using kprobes.
      
      It appears that kprobe fails to retrieve the symbol at address
      0xffff000010081000, despite this symbol being in System.map:
      
        ffff000010081000 T __exception_text_start
      
      This symbol is part of the first group of aliases in the
      kallsyms_offsets array (symbol names generated using ugly hacks in
      scripts/kallsyms.c):
      
        kallsyms_offsets:
                .long   0x1000 // do_undefinstr
                .long   0x1000 // efi_header_end
                .long   0x1000 // _stext
                .long   0x1000 // __exception_text_start
                .long   0x12b0 // do_cp15instr
      
      Looking at the implementation of get_symbol_pos(), it returns the
      lowest index for aliasing symbols. In this case, it return 0.
      
      But kallsyms_lookup_size_offset() considers 0 as a failure, which
      is obviously wrong (there is definitely a valid symbol living there).
      In turn, the kprobe blacklisting stops abruptly, hence the original
      error.
      
      A CONFIG_KALLSYMS_ALL kernel wouldn't fail as there is always
      some random symbols at the beginning of this array, which are never
      looked up via kallsyms_lookup_size_offset.
      
      Fix it by considering that get_symbol_pos() is always successful
      (which is consistent with the other uses of this function).
      
      Fixes: ffc50891 ("[PATCH] Create kallsyms_lookup_size_offset()")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9a74f799
  11. 19 9月, 2019 3 次提交
    • Y
      modules: fix compile error if don't have strict module rwx · 52bfcc9c
      Yang Yingliang 提交于
      commit 93651f80dcb616b8c9115cdafc8e57a781af22d0 upstream.
      
      If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is not defined,
      we need stub for module_enable_nx() and module_enable_x().
      
      If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is defined, but
      CONFIG_STRICT_MODULE_RWX is disabled, we need stub for
      module_enable_nx.
      
      Move frob_text() outside of the CONFIG_STRICT_MODULE_RWX,
      because it is needed anyway.
      
      Fixes: 2eef1399a866 ("modules: fix BUG when load module with rodata=n")
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52bfcc9c
    • Y
      modules: fix BUG when load module with rodata=n · ae415d7a
      Yang Yingliang 提交于
      commit 2eef1399a866c57687962e15142b141a4f8e7862 upstream.
      
      When loading a module with rodata=n, it causes an executing
      NX-protected page BUG.
      
      [   32.379191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
      [   32.382917] BUG: unable to handle page fault for address: ffffffffc0005000
      [   32.385947] #PF: supervisor instruction fetch in kernel mode
      [   32.387662] #PF: error_code(0x0011) - permissions violation
      [   32.389352] PGD 240c067 P4D 240c067 PUD 240e067 PMD 421a52067 PTE 8000000421a53063
      [   32.391396] Oops: 0011 [#1] SMP PTI
      [   32.392478] CPU: 7 PID: 2697 Comm: insmod Tainted: G           O      5.2.0-rc5+ #202
      [   32.394588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [   32.398157] RIP: 0010:ko_test_init+0x0/0x1000 [ko_test]
      [   32.399662] Code: Bad RIP value.
      [   32.400621] RSP: 0018:ffffc900029f3ca8 EFLAGS: 00010246
      [   32.402171] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   32.404332] RDX: 00000000000004c7 RSI: 0000000000000cc0 RDI: ffffffffc0005000
      [   32.406347] RBP: ffffffffc0005000 R08: ffff88842fbebc40 R09: ffffffff810ede4a
      [   32.408392] R10: ffffea00108e3480 R11: 0000000000000000 R12: ffff88842bee21a0
      [   32.410472] R13: 0000000000000001 R14: 0000000000000001 R15: ffffc900029f3e78
      [   32.412609] FS:  00007fb4f0c0a700(0000) GS:ffff88842fbc0000(0000) knlGS:0000000000000000
      [   32.414722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   32.416290] CR2: ffffffffc0004fd6 CR3: 0000000421a90004 CR4: 0000000000020ee0
      [   32.418471] Call Trace:
      [   32.419136]  do_one_initcall+0x41/0x1df
      [   32.420199]  ? _cond_resched+0x10/0x40
      [   32.421433]  ? kmem_cache_alloc_trace+0x36/0x160
      [   32.422827]  do_init_module+0x56/0x1f7
      [   32.423946]  load_module+0x1e67/0x2580
      [   32.424947]  ? __alloc_pages_nodemask+0x150/0x2c0
      [   32.426413]  ? map_vm_area+0x2d/0x40
      [   32.427530]  ? __vmalloc_node_range+0x1ef/0x260
      [   32.428850]  ? __do_sys_init_module+0x135/0x170
      [   32.430060]  ? _cond_resched+0x10/0x40
      [   32.431249]  __do_sys_init_module+0x135/0x170
      [   32.432547]  do_syscall_64+0x43/0x120
      [   32.433853]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Because if rodata=n, set_memory_x() can't be called, fix this by
      calling set_memory_x in complete_formation();
      
      Fixes: f2c65fb3221a ("x86/modules: Avoid breaking W^X while loading modules")
      Suggested-by: NJian Cheng <cj.chengjian@huawei.com>
      Reviewed-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae415d7a
    • Y
      genirq: Prevent NULL pointer dereference in resend_irqs() · 991b3458
      Yunfeng Ye 提交于
      commit eddf3e9c7c7e4d0707c68d1bb22cc6ec8aef7d4a upstream.
      
      The following crash was observed:
      
        Unable to handle kernel NULL pointer dereference at 0000000000000158
        Internal error: Oops: 96000004 [#1] SMP
        pc : resend_irqs+0x68/0xb0
        lr : resend_irqs+0x64/0xb0
        ...
        Call trace:
         resend_irqs+0x68/0xb0
         tasklet_action_common.isra.6+0x84/0x138
         tasklet_action+0x2c/0x38
         __do_softirq+0x120/0x324
         run_ksoftirqd+0x44/0x60
         smpboot_thread_fn+0x1ac/0x1e8
         kthread+0x134/0x138
         ret_from_fork+0x10/0x18
      
      The reason for this is that the interrupt resend mechanism happens in soft
      interrupt context, which is a asynchronous mechanism versus other
      operations on interrupts. free_irq() does not take resend handling into
      account. Thus, the irq descriptor might be already freed before the resend
      tasklet is executed. resend_irqs() does not check the return value of the
      interrupt descriptor lookup and derefences the return value
      unconditionally.
      
        1):
        __setup_irq
          irq_startup
            check_irq_resend  // activate softirq to handle resend irq
        2):
        irq_domain_free_irqs
          irq_free_descs
            free_desc
              call_rcu(&desc->rcu, delayed_free_desc)
        3):
        __do_softirq
          tasklet_action
            resend_irqs
              desc = irq_to_desc(irq)
              desc->handle_irq(desc)  // desc is NULL --> Ooops
      
      Fix this by adding a NULL pointer check in resend_irqs() before derefencing
      the irq descriptor.
      
      Fixes: a4633adc ("[PATCH] genirq: add genirq sw IRQ-retrigger")
      Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      991b3458
  12. 16 9月, 2019 1 次提交