1. 12 10月, 2019 40 次提交
    • M
      arm64: Add sysfs vulnerability show for spectre-v1 · 047aac35
      Mian Yousaf Kaukab 提交于
      [ Upstream commit 3891ebccace188af075ce143d8b072b65e90f695 ]
      
      spectre-v1 has been mitigated and the mitigation is always active.
      Report this to userspace via sysfs
      Signed-off-by: NMian Yousaf Kaukab <ykaukab@suse.de>
      Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
      Reviewed-by: NAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: NStefan Wahren <stefan.wahren@i2se.com>
      Acked-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      047aac35
    • M
      arm64: fix SSBS sanitization · edfc0266
      Mark Rutland 提交于
      [ Upstream commit f54dada8274643e3ff4436df0ea124aeedc43cae ]
      
      In valid_user_regs() we treat SSBS as a RES0 bit, and consequently it is
      unexpectedly cleared when we restore a sigframe or fiddle with GPRs via
      ptrace.
      
      This patch fixes valid_user_regs() to account for this, updating the
      function to refer to the latest ARM ARM (ARM DDI 0487D.a). For AArch32
      tasks, SSBS appears in bit 23 of SPSR_EL1, matching its position in the
      AArch32-native PSR format, and we don't need to translate it as we have
      to for DIT.
      
      There are no other bit assignments that we need to account for today.
      As the recent documentation describes the DIT bit, we can drop our
      comment regarding DIT.
      
      While removing SSBS from the RES0 masks, existing inconsistent
      whitespace is corrected.
      
      Fixes: d71be2b6c0e19180 ("arm64: cpufeature: Detect SSBS and advertise to userspace")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edfc0266
    • W
      arm64: docs: Document SSBS HWCAP · 09c22781
      Will Deacon 提交于
      [ Upstream commit ee91176120bd584aa10c564e7e9fdcaf397190a1 ]
      
      We advertise the MRS/MSR instructions for toggling SSBS at EL0 using an
      HWCAP, so document it along with the others.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09c22781
    • W
      KVM: arm64: Set SCTLR_EL2.DSSBS if SSBD is forcefully disabled and !vhe · a59d42ac
      Will Deacon 提交于
      [ Upstream commit 7c36447ae5a090729e7b129f24705bb231a07e0b ]
      
      When running without VHE, it is necessary to set SCTLR_EL2.DSSBS if SSBD
      has been forcefully disabled on the kernel command-line.
      Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a59d42ac
    • W
      arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3 · 1eaff33e
      Will Deacon 提交于
      [ Upstream commit 8f04e8e6e29c93421a95b61cad62e3918425eac7 ]
      
      On CPUs with support for PSTATE.SSBS, the kernel can toggle the SSBD
      state without needing to call into firmware.
      
      This patch hooks into the existing SSBD infrastructure so that SSBS is
      used on CPUs that support it, but it's all made horribly complicated by
      the very real possibility of big/little systems that don't uniformly
      provide the new capability.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1eaff33e
    • V
      riscv: Avoid interrupts being erroneously enabled in handle_exception() · d286a374
      Vincent Chen 提交于
      [ Upstream commit c82dd6d078a2bb29d41eda032bb96d05699a524d ]
      
      When the handle_exception function addresses an exception, the interrupts
      will be unconditionally enabled after finishing the context save. However,
      It may erroneously enable the interrupts if the interrupts are disabled
      before entering the handle_exception.
      
      For example, one of the WARN_ON() condition is satisfied in the scheduling
      where the interrupt is disabled and rq.lock is locked. The WARN_ON will
      trigger a break exception and the handle_exception function will enable the
      interrupts before entering do_trap_break function. During the procedure, if
      a timer interrupt is pending, it will be taken when interrupts are enabled.
      In this case, it may cause a deadlock problem if the rq.lock is locked
      again in the timer ISR.
      
      Hence, the handle_exception() can only enable interrupts when the state of
      sstatus.SPIE is 1.
      
      This patch is tested on HiFive Unleashed board.
      Signed-off-by: NVincent Chen <vincent.chen@sifive.com>
      Reviewed-by: NPalmer Dabbelt <palmer@sifive.com>
      [paul.walmsley@sifive.com: updated to apply]
      Fixes: bcae803a ("RISC-V: Enable IRQ during exception handling")
      Cc: David Abdurachmanov <david.abdurachmanov@sifive.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Walmsley <paul.walmsley@sifive.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d286a374
    • S
      perf stat: Reset previous counts on repeat with interval · 5b67a472
      Srikar Dronamraju 提交于
      [ Upstream commit b63fd11cced17fcb8e133def29001b0f6aaa5e06 ]
      
      When using 'perf stat' with repeat and interval option, it shows wrong
      values for events.
      
      The wrong values will be shown for the first interval on the second and
      subsequent repetitions.
      
      Without the fix:
      
        # perf stat -r 3 -I 2000 -e faults -e sched:sched_switch -a sleep 5
      
           2.000282489                 53      faults
           2.000282489                513      sched:sched_switch
           4.005478208              3,721      faults
           4.005478208              2,666      sched:sched_switch
           5.025470933                395      faults
           5.025470933              1,307      sched:sched_switch
           2.009602825 1,84,46,74,40,73,70,95,47,520      faults 		<------
           2.009602825 1,84,46,74,40,73,70,95,49,568      sched:sched_switch  <------
           4.019612206              4,730      faults
           4.019612206              2,746      sched:sched_switch
           5.039615484              3,953      faults
           5.039615484              1,496      sched:sched_switch
           2.000274620 1,84,46,74,40,73,70,95,47,520      faults		<------
           2.000274620 1,84,46,74,40,73,70,95,47,520      sched:sched_switch	<------
           4.000480342              4,282      faults
           4.000480342              2,303      sched:sched_switch
           5.000916811              1,322      faults
           5.000916811              1,064      sched:sched_switch
        #
      
      prev_raw_counts is allocated when using intervals. This is used when
      calculating the difference in the counts of events when using interval.
      
      The current counts are stored in prev_raw_counts to calculate the
      differences in the next iteration.
      
      On the first interval of the second and subsequent repetitions,
      prev_raw_counts would be the values stored in the last interval of the
      previous repetitions, while the current counts will only be for the
      first interval of the current repetition.
      
      Hence there is a possibility of events showing up as big number.
      
      Fix this by resetting prev_raw_counts whenever perf stat repeats the
      command.
      
      With the fix:
      
        # perf stat -r 3 -I 2000 -e faults -e sched:sched_switch -a sleep 5
      
           2.019349347              2,597      faults
           2.019349347              2,753      sched:sched_switch
           4.019577372              3,098      faults
           4.019577372              2,532      sched:sched_switch
           5.019415481              1,879      faults
           5.019415481              1,356      sched:sched_switch
           2.000178813              8,468      faults
           2.000178813              2,254      sched:sched_switch
           4.000404621              7,440      faults
           4.000404621              1,266      sched:sched_switch
           5.040196079              2,458      faults
           5.040196079                556      sched:sched_switch
           2.000191939              6,870      faults
           2.000191939              1,170      sched:sched_switch
           4.000414103                541      faults
           4.000414103                902      sched:sched_switch
           5.000809863                450      faults
           5.000809863                364      sched:sched_switch
        #
      
      Committer notes:
      
      This was broken since the cset introducing the --interval feature, i.e.
      --repeat + --interval wasn't tested at that point, add the Fixes tag so
      that automatic scripts can pick this up.
      
      Fixes: 13370a9b ("perf stat: Add interval printing")
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: stable@vger.kernel.org # v3.9+
      Link: http://lore.kernel.org/lkml/20190904094738.9558-2-srikar@linux.vnet.ibm.com
      [ Fixed up conflicts with libperf, i.e. some perf_{evsel,evlist} lost the 'perf' prefix ]
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5b67a472
    • J
      perf tools: Fix segfault in cpu_cache_level__read() · 15c57bf9
      Jiri Olsa 提交于
      [ Upstream commit 0216234c2eed1367a318daeb9f4a97d8217412a0 ]
      
      We release wrong pointer on error path in cpu_cache_level__read
      function, leading to segfault:
      
        (gdb) r record ls
        Starting program: /root/perf/tools/perf/perf record ls
        ...
        [ perf record: Woken up 1 times to write data ]
        double free or corruption (out)
      
        Thread 1 "perf" received signal SIGABRT, Aborted.
        0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
        (gdb) bt
        #0  0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
        #1  0x00007ffff7443bac in abort () from /lib64/power9/libc.so.6
        #2  0x00007ffff74af8bc in __libc_message () from /lib64/power9/libc.so.6
        #3  0x00007ffff74b92b8 in malloc_printerr () from /lib64/power9/libc.so.6
        #4  0x00007ffff74bb874 in _int_free () from /lib64/power9/libc.so.6
        #5  0x0000000010271260 in __zfree (ptr=0x7fffffffa0b0) at ../../lib/zalloc..
        #6  0x0000000010139340 in cpu_cache_level__read (cache=0x7fffffffa090, cac..
        #7  0x0000000010143c90 in build_caches (cntp=0x7fffffffa118, size=<optimiz..
        ...
      
      Releasing the proper pointer.
      
      Fixes: 720e98b5 ("perf tools: Add perf data cache feature")
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org: # v4.6+
      Link: http://lore.kernel.org/lkml/20190912105235.10689-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      15c57bf9
    • B
      tick: broadcast-hrtimer: Fix a race in bc_set_next · e5331c37
      Balasubramani Vivekanandan 提交于
      [ Upstream commit b9023b91dd020ad7e093baa5122b6968c48cc9e0 ]
      
      When a cpu requests broadcasting, before starting the tick broadcast
      hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
      using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
      the required synchronization when the callback is active on other core.
      
      The callback could have already executed tick_handle_oneshot_broadcast()
      and could have also returned. But still there is a small time window where
      the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
      without doing anything, but the next_event of the tick broadcast clock
      device is already set to a timeout value.
      
      In the race condition diagram below, CPU #1 is running the timer callback
      and CPU #2 is entering idle state and so calls bc_set_next().
      
      In the worst case, the next_event will contain an expiry time, but the
      hrtimer will not be started which happens when the racing callback returns
      HRTIMER_NORESTART. The hrtimer might never recover if all further requests
      from the CPUs to subscribe to tick broadcast have timeout greater than the
      next_event of tick broadcast clock device. This leads to cascading of
      failures and finally noticed as rcu stall warnings
      
      Here is a depiction of the race condition
      
      CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                        and subscribe to
                                                        tick broadcast)
      ---------------------                             ---------------------
      
      __run_hrtimer()                                   tick_broadcast_enter()
      
        bc_handler()                                      __tick_broadcast_oneshot_control()
      
          tick_handle_oneshot_broadcast()
      
            raw_spin_lock(&tick_broadcast_lock);
      
            dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
            //next_event for tick broadcast clock
            set to KTIME_MAX since no other cores
            subscribed to tick broadcasting
      
            raw_spin_unlock(&tick_broadcast_lock);
      
          if (dev->next_event == KTIME_MAX)
            return HRTIMER_NORESTART
          // callback function exits without
             restarting the hrtimer                      //tick_broadcast_lock acquired
                                                         raw_spin_lock(&tick_broadcast_lock);
      
                                                         tick_broadcast_set_event()
      
                                                           clockevents_program_event()
      
                                                             dev->next_event = expires;
      
                                                             bc_set_next()
      
                                                               hrtimer_try_to_cancel()
                                                               //returns -1 since the timer
                                                               callback is active. Exits without
                                                               restarting the timer
        cpu_base->running = NULL;
      
      The comment that hrtimer cannot be armed from within the callback is
      wrong. It is fine to start the hrtimer from within the callback. Also it is
      safe to start the hrtimer from the enter/exit idle code while the broadcast
      handler is active. The enter/exit idle code and the broadcast handler are
      synchronized using tick_broadcast_lock. So there is no need for the
      existing try to cancel logic. All this can be removed which will eliminate
      the race condition as well.
      
      Fixes: 5d1638ac ("tick: Introduce hrtimer based broadcast")
      Originally-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBalasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      e5331c37
    • S
      tools lib traceevent: Do not free tep->cmdlines in add_new_comm() on failure · 140acbb0
      Steven Rostedt (VMware) 提交于
      [ Upstream commit e0d2615856b2046c2e8d5bfd6933f37f69703b0b ]
      
      If the re-allocation of tep->cmdlines succeeds, then the previous
      allocation of tep->cmdlines will be freed. If we later fail in
      add_new_comm(), we must not free cmdlines, and also should assign
      tep->cmdlines to the new allocation. Otherwise when freeing tep, the
      tep->cmdlines will be pointing to garbage.
      
      Fixes: a6d2a61a ("tools lib traceevent: Remove some die() calls")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: linux-trace-devel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190828191819.970121417@goodmis.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      140acbb0
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · d1e4b4cc
      Aneesh Kumar K.V 提交于
      commit 09ce98cacd51fcd0fa0af2f79d1e1d3192f4cbb0 upstream.
      
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1e4b4cc
    • G
      powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt() · f5f31a6e
      Gautham R. Shenoy 提交于
      [ Upstream commit c784be435d5dae28d3b03db31753dd7a18733f0c ]
      
      The calls to arch_add_memory()/arch_remove_memory() are always made
      with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
      On pSeries, arch_add_memory()/arch_remove_memory() eventually call
      resize_hpt() which in turn calls stop_machine() which acquires the
      read-side cpu_hotplug_lock again, thereby resulting in the recursive
      acquisition of this lock.
      
      In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
      lockup during a memory hotplug operation because cpus_read_lock() is a
      per-cpu rwsem read, which, in the fast-path (in the absence of the
      writer, which in our case is a CPU-hotplug operation) simply
      increments the read_count on the semaphore. Thus a recursive read in
      the fast-path doesn't cause any problems.
      
      However, we can hit this problem in practice if there is a concurrent
      CPU-Hotplug operation in progress which is waiting to acquire the
      write-side of the lock. This will cause the second recursive read to
      block until the writer finishes. While the writer is blocked since the
      first read holds the lock. Thus both the reader as well as the writers
      fail to make any progress thereby blocking both CPU-Hotplug as well as
      Memory Hotplug operations.
      
      Memory-Hotplug				CPU-Hotplug
      CPU 0					CPU 1
      ------                                  ------
      
      1. down_read(cpu_hotplug_lock.rw_sem)
         [memory_hotplug_begin]
      					2. down_write(cpu_hotplug_lock.rw_sem)
      					[cpu_up/cpu_down]
      3. down_read(cpu_hotplug_lock.rw_sem)
         [stop_machine()]
      
      Lockdep complains as follows in these code-paths.
      
       swapper/0/1 is trying to acquire lock:
       (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60
      
      but task is already holding lock:
      (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(cpu_hotplug_lock.rw_sem);
         lock(cpu_hotplug_lock.rw_sem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by swapper/0/1:
        #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
        #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
        #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0
      
      stack backtrace:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
       Call Trace:
         dump_stack+0xe8/0x164 (unreliable)
         __lock_acquire+0x1110/0x1c70
         lock_acquire+0x240/0x290
         cpus_read_lock+0x64/0xf0
         stop_machine+0x2c/0x60
         pseries_lpar_resize_hpt+0x19c/0x2c0
         resize_hpt_for_hotplug+0x70/0xd0
         arch_add_memory+0x58/0xfc
         devm_memremap_pages+0x5e8/0x8f0
         pmem_attach_disk+0x764/0x830
         nvdimm_bus_probe+0x118/0x240
         really_probe+0x230/0x4b0
         driver_probe_device+0x16c/0x1e0
         __driver_attach+0x148/0x1b0
         bus_for_each_dev+0x90/0x130
         driver_attach+0x34/0x50
         bus_add_driver+0x1a8/0x360
         driver_register+0x108/0x170
         __nd_driver_register+0xd0/0xf0
         nd_pmem_driver_init+0x34/0x48
         do_one_initcall+0x1e0/0x45c
         kernel_init_freeable+0x540/0x64c
         kernel_init+0x2c/0x160
         ret_from_kernel_thread+0x5c/0x68
      
      Fix this issue by
        1) Requiring all the calls to pseries_lpar_resize_hpt() be made
           with cpu_hotplug_lock held.
      
        2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
           as a consequence of 1)
      
        3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
           with cpu_hotplug_lock held.
      
      Fixes: dbcf929c ("powerpc/pseries: Add support for hash table resizing")
      Cc: stable@vger.kernel.org # v4.11+
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      f5f31a6e
    • X
      nbd: fix crash when the blksize is zero · c688982f
      Xiubo Li 提交于
      [ Upstream commit 553768d1169a48c0cd87c4eb4ab57534ee663415 ]
      
      This will allow the blksize to be set zero and then use 1024 as
      default.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NXiubo Li <xiubli@redhat.com>
      [fix to use goto out instead of return in genl_connect]
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c688982f
    • S
      KVM: nVMX: Fix consistency check on injected exception error code · 63bb8b76
      Sean Christopherson 提交于
      [ Upstream commit 567926cca99ba1750be8aae9c4178796bf9bb90b ]
      
      Current versions of Intel's SDM incorrectly state that "bits 31:15 of
      the VM-Entry exception error-code field" must be zero.  In reality, bits
      31:16 must be zero, i.e. error codes are 16-bit values.
      
      The bogus error code check manifests as an unexpected VM-Entry failure
      due to an invalid code field (error number 7) in L1, e.g. when injecting
      a #GP with error_code=0x9f00.
      
      Nadav previously reported the bug[*], both to KVM and Intel, and fixed
      the associated kvm-unit-test.
      
      [*] https://patchwork.kernel.org/patch/11124749/Reported-by: NNadav Amit <namit@vmware.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      63bb8b76
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 34b13ff6
      Cédric Le Goater 提交于
      [ Upstream commit 237aed48c642328ff0ab19b63423634340224a06 ]
      
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      34b13ff6
    • H
      drm/radeon: Bail earlier when radeon.cik_/si_support=0 is passed · 1b155b4f
      Hans de Goede 提交于
      [ Upstream commit 9dbc88d013b79c62bd845cb9e7c0256e660967c5 ]
      
      Bail from the pci_driver probe function instead of from the drm_driver
      load function.
      
      This avoid /dev/dri/card0 temporarily getting registered and then
      unregistered again, sending unwanted add / remove udev events to
      userspace.
      
      Specifically this avoids triggering the (userspace) bug fixed by this
      plymouth merge-request:
      https://gitlab.freedesktop.org/plymouth/plymouth/merge_requests/59
      
      Note that despite that being an userspace bug, not sending unnecessary
      udev events is a good idea in general.
      
      BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1490490Reviewed-by: NMichel Dänzer <mdaenzer@redhat.com>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1b155b4f
    • N
      nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs · 04e0c84f
      Navid Emamdoost 提交于
      [ Upstream commit 8ce39eb5a67aee25d9f05b40b673c95b23502e3e ]
      
      In nfp_flower_spawn_vnic_reprs in the loop if initialization or the
      allocations fail memory is leaked. Appropriate releases are added.
      
      Fixes: b9452452 ("nfp: flower: add per repr private data for LAG offload")
      Signed-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      04e0c84f
    • A
      perf unwind: Fix libunwind build failure on i386 systems · 575a5bb3
      Arnaldo Carvalho de Melo 提交于
      [ Upstream commit 26acf400d2dcc72c7e713e1f55db47ad92010cc2 ]
      
      Naresh Kamboju reported, that on the i386 build pr_err()
      doesn't get defined properly due to header ordering:
      
        perf-in.o: In function `libunwind__x86_reg_id':
        tools/perf/util/libunwind/../../arch/x86/util/unwind-libunwind.c:109:
        undefined reference to `pr_err'
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      575a5bb3
    • V
      kernel/elfcore.c: include proper prototypes · b0aaf65b
      Valdis Kletnieks 提交于
      [ Upstream commit 0f74914071ab7e7b78731ed62bf350e3a344e0a5 ]
      
      When building with W=1, gcc properly complains that there's no prototypes:
      
        CC      kernel/elfcore.o
      kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
          7 | Elf_Half __weak elf_core_extra_phdrs(void)
            |                 ^~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
         12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
            |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
         17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
            |            ^~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
         22 | size_t __weak elf_core_extra_data_size(void)
            |               ^~~~~~~~~~~~~~~~~~~~~~~~
      
      Provide the include file so gcc is happy, and we don't have potential code drift
      
      Link: http://lkml.kernel.org/r/29875.1565224705@turing-policeSigned-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b0aaf65b
    • T
      perf build: Add detection of java-11-openjdk-devel package · bab46480
      Thomas Richter 提交于
      [ Upstream commit 815c1560bf8fd522b8d93a1d727868b910c1cc24 ]
      
      With Java 11 there is no seperate JRE anymore.
      
      Details:
      
        https://coderanch.com/t/701603/java/JRE-JDK
      
      Therefore the detection of the JRE needs to be adapted.
      
      This change works for s390 and x86.  I have not tested other platforms.
      
      Committer testing:
      
      Continues to work with the OpenJDK 8:
      
        $ rm -f ~acme/lib64/libperf-jvmti.so
        $ rpm -qa | grep jdk-devel
        java-1.8.0-openjdk-devel-1.8.0.222.b10-0.fc30.x86_64
        $ git log --oneline -1
        a51937170f33 (HEAD -> perf/core) perf build: Add detection of java-11-openjdk-devel package
        $ rm -rf /tmp/build/perf ; mkdir -p /tmp/build/perf ; make -C tools/perf O=/tmp/build/perf install > /dev/null 2>1
        $ ls -la ~acme/lib64/libperf-jvmti.so
        -rwxr-xr-x. 1 acme acme 230744 Sep 24 16:46 /home/acme/lib64/libperf-jvmti.so
        $
      Suggested-by: NAndreas Krebbel <krebbel@linux.ibm.com>
      Signed-off-by: NThomas Richter <tmricht@linux.ibm.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20190909114116.50469-4-tmricht@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bab46480
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 46ff0e2f
      KeMeng Shi 提交于
      [ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46ff0e2f
    • M
      sched/membarrier: Fix private expedited registration check · 6cb7aa1b
      Mathieu Desnoyers 提交于
      [ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ]
      
      Fix a logic flaw in the way membarrier_register_private_expedited()
      handles ready state checks for private expedited sync core and private
      expedited registrations.
      
      If a private expedited membarrier registration is first performed, and
      then a private expedited sync_core registration is performed, the ready
      state check will skip the second registration when it really should not.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cb7aa1b
    • M
      sched/membarrier: Call sync_core only before usermode for same mm · e250f2b6
      Mathieu Desnoyers 提交于
      [ Upstream commit 2840cf02fae627860156737e83326df354ee4ec6 ]
      
      When the prev and next task's mm change, switch_mm() provides the core
      serializing guarantees before returning to usermode. The only case
      where an explicit core serialization is needed is when the scheduler
      keeps the same mm for prev and next.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e250f2b6
    • N
      libnvdimm/nfit_test: Fix acpi_handle redefinition · 9f33b178
      Nathan Chancellor 提交于
      [ Upstream commit 59f08896f058a92f03a0041b397a1a227c5e8529 ]
      
      After commit 62974fc389b3 ("libnvdimm: Enable unit test infrastructure
      compile checks"), clang warns:
      
      In file included from
      ../drivers/nvdimm/../../tools/testing/nvdimm/test/iomap.c:15:
      ../drivers/nvdimm/../../tools/testing/nvdimm/test/nfit_test.h:206:15:
      warning: redefinition of typedef 'acpi_handle' is a C11 feature
      [-Wtypedef-redefinition]
      typedef void *acpi_handle;
                    ^
      ../include/acpi/actypes.h:424:15: note: previous definition is here
      typedef void *acpi_handle;      /* Actually a ptr to a NS Node */
                    ^
      1 warning generated.
      
      The include chain:
      
      iomap.c ->
          linux/acpi.h ->
              acpi/acpi.h ->
                  acpi/actypes.h
          nfit_test.h
      
      Avoid this by including linux/acpi.h in nfit_test.h, which allows us to
      remove both the typedef and the forward declaration of acpi_object.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/660Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Link: https://lore.kernel.org/r/20190918042148.77553-1-natechancellor@gmail.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9f33b178
    • Z
      fuse: fix memleak in cuse_channel_open · 7b4f541f
      zhengbin 提交于
      [ Upstream commit 9ad09b1976c562061636ff1e01bfc3a57aebe56b ]
      
      If cuse_send_init fails, need to fuse_conn_put cc->fc.
      
      cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1)
                       ->fuse_dev_alloc->fuse_conn_get
                       ->fuse_dev_free->fuse_conn_put
      
      Fixes: cc080e9e ("fuse: introduce per-instance fuse_dev structure")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7b4f541f
    • A
      libnvdimm/region: Initialize bad block for volatile namespaces · 2e93d24a
      Aneesh Kumar K.V 提交于
      [ Upstream commit c42adf87e4e7ed77f6ffe288dc90f980d07d68df ]
      
      We do check for a bad block during namespace init and that use
      region bad block list. We need to initialize the bad block
      for volatile regions for this to work. We also observe a lockdep
      warning as below because the lock is not initialized correctly
      since we skip bad block init for volatile regions.
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
       Call Trace:
       [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
       [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
       [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
       [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
       [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
       [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
       [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
       [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
       [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
       [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
       [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
       [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
       [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
       [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
       [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
       [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
       [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
       [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
       [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
       [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
       [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
       [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
       [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2e93d24a
    • S
      thermal_hwmon: Sanitize thermal_zone type · 9025adf3
      Stefan Mavrodiev 提交于
      [ Upstream commit 8c7aa184281c01fc26f319059efb94725012921d ]
      
      When calling thermal_add_hwmon_sysfs(), the device type is sanitized by
      replacing '-' with '_'. However tz->type remains unsanitized. Thus
      calling thermal_hwmon_lookup_by_type() returns no device. And if there is
      no device, thermal_remove_hwmon_sysfs() fails with "hwmon device lookup
      failed!".
      
      The result is unregisted hwmon devices in the sysfs.
      
      Fixes: 409ef0ba ("thermal_hwmon: Sanitize attribute name passed to hwmon")
      Signed-off-by: NStefan Mavrodiev <stefan@olimex.com>
      Signed-off-by: NZhang Rui <rui.zhang@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9025adf3
    • I
      thermal: Fix use-after-free when unregistering thermal zone device · c01a9dbe
      Ido Schimmel 提交于
      [ Upstream commit 1851799e1d2978f68eea5d9dff322e121dcf59c1 ]
      
      thermal_zone_device_unregister() cancels the delayed work that polls the
      thermal zone, but it does not wait for it to finish. This is racy with
      respect to the freeing of the thermal zone device, which can result in a
      use-after-free [1].
      
      Fix this by waiting for the delayed work to finish before freeing the
      thermal zone device. Note that thermal_zone_device_set_polling() is
      never invoked from an atomic context, so it is safe to call
      cancel_delayed_work_sync() that can block.
      
      [1]
      [  +0.002221] ==================================================================
      [  +0.000064] BUG: KASAN: use-after-free in __mutex_lock+0x1076/0x11c0
      [  +0.000016] Read of size 8 at addr ffff8881e48e0450 by task kworker/1:0/17
      
      [  +0.000023] CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted 5.2.0-rc6-custom-02495-g8e73ca3be4af #1701
      [  +0.000010] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
      [  +0.000016] Workqueue: events_freezable_power_ thermal_zone_device_check
      [  +0.000012] Call Trace:
      [  +0.000021]  dump_stack+0xa9/0x10e
      [  +0.000020]  print_address_description.cold.2+0x9/0x25e
      [  +0.000018]  __kasan_report.cold.3+0x78/0x9d
      [  +0.000016]  kasan_report+0xe/0x20
      [  +0.000016]  __mutex_lock+0x1076/0x11c0
      [  +0.000014]  step_wise_throttle+0x72/0x150
      [  +0.000018]  handle_thermal_trip+0x167/0x760
      [  +0.000019]  thermal_zone_device_update+0x19e/0x5f0
      [  +0.000019]  process_one_work+0x969/0x16f0
      [  +0.000017]  worker_thread+0x91/0xc40
      [  +0.000014]  kthread+0x33d/0x400
      [  +0.000015]  ret_from_fork+0x3a/0x50
      
      [  +0.000020] Allocated by task 1:
      [  +0.000015]  save_stack+0x19/0x80
      [  +0.000015]  __kasan_kmalloc.constprop.4+0xc1/0xd0
      [  +0.000014]  kmem_cache_alloc_trace+0x152/0x320
      [  +0.000015]  thermal_zone_device_register+0x1b4/0x13a0
      [  +0.000015]  mlxsw_thermal_init+0xc92/0x23d0
      [  +0.000014]  __mlxsw_core_bus_device_register+0x659/0x11b0
      [  +0.000013]  mlxsw_core_bus_device_register+0x3d/0x90
      [  +0.000013]  mlxsw_pci_probe+0x355/0x4b0
      [  +0.000014]  local_pci_probe+0xc3/0x150
      [  +0.000013]  pci_device_probe+0x280/0x410
      [  +0.000013]  really_probe+0x26a/0xbb0
      [  +0.000013]  driver_probe_device+0x208/0x2e0
      [  +0.000013]  device_driver_attach+0xfe/0x140
      [  +0.000013]  __driver_attach+0x110/0x310
      [  +0.000013]  bus_for_each_dev+0x14b/0x1d0
      [  +0.000013]  driver_register+0x1c0/0x400
      [  +0.000015]  mlxsw_sp_module_init+0x5d/0xd3
      [  +0.000014]  do_one_initcall+0x239/0x4dd
      [  +0.000013]  kernel_init_freeable+0x42b/0x4e8
      [  +0.000012]  kernel_init+0x11/0x18b
      [  +0.000013]  ret_from_fork+0x3a/0x50
      
      [  +0.000015] Freed by task 581:
      [  +0.000013]  save_stack+0x19/0x80
      [  +0.000014]  __kasan_slab_free+0x125/0x170
      [  +0.000013]  kfree+0xf3/0x310
      [  +0.000013]  thermal_release+0xc7/0xf0
      [  +0.000014]  device_release+0x77/0x200
      [  +0.000014]  kobject_put+0x1a8/0x4c0
      [  +0.000014]  device_unregister+0x38/0xc0
      [  +0.000014]  thermal_zone_device_unregister+0x54e/0x6a0
      [  +0.000014]  mlxsw_thermal_fini+0x184/0x35a
      [  +0.000014]  mlxsw_core_bus_device_unregister+0x10a/0x640
      [  +0.000013]  mlxsw_devlink_core_bus_device_reload+0x92/0x210
      [  +0.000015]  devlink_nl_cmd_reload+0x113/0x1f0
      [  +0.000014]  genl_family_rcv_msg+0x700/0xee0
      [  +0.000013]  genl_rcv_msg+0xca/0x170
      [  +0.000013]  netlink_rcv_skb+0x137/0x3a0
      [  +0.000012]  genl_rcv+0x29/0x40
      [  +0.000013]  netlink_unicast+0x49b/0x660
      [  +0.000013]  netlink_sendmsg+0x755/0xc90
      [  +0.000013]  __sys_sendto+0x3de/0x430
      [  +0.000013]  __x64_sys_sendto+0xe2/0x1b0
      [  +0.000013]  do_syscall_64+0xa4/0x4d0
      [  +0.000013]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [  +0.000017] The buggy address belongs to the object at ffff8881e48e0008
                     which belongs to the cache kmalloc-2k of size 2048
      [  +0.000012] The buggy address is located 1096 bytes inside of
                     2048-byte region [ffff8881e48e0008, ffff8881e48e0808)
      [  +0.000007] The buggy address belongs to the page:
      [  +0.000012] page:ffffea0007923800 refcount:1 mapcount:0 mapping:ffff88823680d0c0 index:0x0 compound_mapcount: 0
      [  +0.000020] flags: 0x200000000010200(slab|head)
      [  +0.000019] raw: 0200000000010200 ffffea0007682008 ffffea00076ab808 ffff88823680d0c0
      [  +0.000016] raw: 0000000000000000 00000000000d000d 00000001ffffffff 0000000000000000
      [  +0.000007] page dumped because: kasan: bad access detected
      
      [  +0.000012] Memory state around the buggy address:
      [  +0.000012]  ffff8881e48e0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  +0.000012]  ffff8881e48e0380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  +0.000012] >ffff8881e48e0400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  +0.000008]                                                  ^
      [  +0.000012]  ffff8881e48e0480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  +0.000012]  ffff8881e48e0500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  +0.000007] ==================================================================
      
      Fixes: b1569e99 ("ACPI: move thermal trip handling to generic thermal layer")
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NZhang Rui <rui.zhang@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c01a9dbe
    • S
      ntb: point to right memory window index · 55ebeb4e
      Sanjay R Mehta 提交于
      [ Upstream commit ae89339b08f3fe02457ec9edd512ddc3d246d0f8 ]
      
      second parameter of ntb_peer_mw_get_addr is pointing to wrong memory
      window index by passing "peer gidx" instead of "local gidx".
      
      For ex, "local gidx" value is '0' and "peer gidx" value is '1', then
      
      on peer side ntb_mw_set_trans() api is used as below with gidx pointing to
      local side gidx which is '0', so memroy window '0' is chosen and XLAT '0'
      will be programmed by peer side.
      
          ntb_mw_set_trans(perf->ntb, peer->pidx, peer->gidx, peer->inbuf_xlat,
                          peer->inbuf_size);
      
      Now, on local side ntb_peer_mw_get_addr() is been used as below with gidx
      pointing to "peer gidx" which is '1', so pointing to memory window '1'
      instead of memory window '0'.
      
          ntb_peer_mw_get_addr(perf->ntb,  peer->gidx, &phys_addr,
                              &peer->outbuf_size);
      
      So this patch pass "local gidx" as parameter to ntb_peer_mw_get_addr().
      Signed-off-by: NSanjay R Mehta <sanju.mehta@amd.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      55ebeb4e
    • A
      x86/purgatory: Disable the stackleak GCC plugin for the purgatory · 9dabade5
      Arvind Sankar 提交于
      [ Upstream commit ca14c996afe7228ff9b480cf225211cc17212688 ]
      
      Since commit:
      
        b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      
      kexec breaks if GCC_PLUGIN_STACKLEAK=y is enabled, as the purgatory
      contains undefined references to stackleak_track_stack.
      
      Attempting to load a kexec kernel results in this failure:
      
        kexec: Undefined symbol: stackleak_track_stack
        kexec-bzImage64: Loading purgatory failed
      
      Fix this by disabling the stackleak plugin for the purgatory.
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      Link: https://lkml.kernel.org/r/20190923171753.GA2252517@rani.riverdale.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dabade5
    • F
      pwm: stm32-lp: Add check in case requested period cannot be achieved · 65348659
      Fabrice Gasnier 提交于
      [ Upstream commit c91e3234c6035baf5a79763cb4fcd5d23ce75c2b ]
      
      LPTimer can use a 32KHz clock for counting. It depends on clock tree
      configuration. In such a case, PWM output frequency range is limited.
      Although unlikely, nothing prevents user from requesting a PWM frequency
      above counting clock (32KHz for instance):
      - This causes (prd - 1) = 0xffff to be written in ARR register later in
      the apply() routine.
      This results in badly configured PWM period (and also duty_cycle).
      Add a check to report an error is such a case.
      Signed-off-by: NFabrice Gasnier <fabrice.gasnier@st.com>
      Reviewed-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NThierry Reding <thierry.reding@gmail.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      65348659
    • T
      pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors · 19b1c70e
      Trond Myklebust 提交于
      [ Upstream commit 9c47b18cf722184f32148784189fca945a7d0561 ]
      
      IF the server rejected our layout return with a state error such as
      NFS4ERR_BAD_STATEID, or even a stale inode error, then we do want
      to clear out all the remaining layout segments and mark that stateid
      as invalid.
      
      Fixes: 1c5bd76d ("pNFS: Enable layoutreturn operation for...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      19b1c70e
    • T
      drm/amdgpu: Check for valid number of registers to read · 1c70ae6a
      Trek 提交于
      [ Upstream commit 73d8e6c7b841d9bf298c8928f228fb433676635c ]
      
      Do not try to allocate any amount of memory requested by the user.
      Instead limit it to 128 registers. Actually the longest series of
      consecutive allowed registers are 48, mmGB_TILE_MODE0-31 and
      mmGB_MACROTILE_MODE0-15 (0x2644-0x2673).
      
      Bug: https://bugs.freedesktop.org/show_bug.cgi?id=111273Signed-off-by: NTrek <trek00@inbox.ru>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1c70ae6a
    • F
      drm/amdgpu: Fix KFD-related kernel oops on Hawaii · e0af3b19
      Felix Kuehling 提交于
      [ Upstream commit dcafbd50f2e4d5cc964aae409fb5691b743fba23 ]
      
      Hawaii needs to flush caches explicitly, submitting an IB in a user
      VMID from kernel mode. There is no s_fence in this case.
      
      Fixes: eb3961a5 ("drm/amdgpu: remove fence context from the job")
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e0af3b19
    • F
      netfilter: nf_tables: allow lookups in dynamic sets · f7ace7f2
      Florian Westphal 提交于
      [ Upstream commit acab713177377d9e0889c46bac7ff0cfb9a90c4d ]
      
      This un-breaks lookups in sets that have the 'dynamic' flag set.
      Given this active example configuration:
      
      table filter {
        set set1 {
          type ipv4_addr
          size 64
          flags dynamic,timeout
          timeout 1m
        }
      
        chain input {
           type filter hook input priority 0; policy accept;
        }
      }
      
      ... this works:
      nft add rule ip filter input add @set1 { ip saddr }
      
      -> whenever rule is triggered, the source ip address is inserted
      into the set (if it did not exist).
      
      This won't work:
      nft add rule ip filter input ip saddr @set1 counter
      Error: Could not process rule: Operation not supported
      
      In other words, we can add entries to the set, but then can't make
      matching decision based on that set.
      
      That is just wrong -- all set backends support lookups (else they would
      not be very useful).
      The failure comes from an explicit rejection in nft_lookup.c.
      
      Looking at the history, it seems like NFT_SET_EVAL used to mean
      'set contains expressions' (aka. "is a meter"), for instance something like
      
       nft add rule ip filter input meter example { ip saddr limit rate 10/second }
       or
       nft add rule ip filter input meter example { ip saddr counter }
      
      The actual meaning of NFT_SET_EVAL however, is
      'set can be updated from the packet path'.
      
      'meters' and packet-path insertions into sets, such as
      'add @set { ip saddr }' use exactly the same kernel code (nft_dynset.c)
      and thus require a set backend that provides the ->update() function.
      
      The only set that provides this also is the only one that has the
      NFT_SET_EVAL feature flag.
      
      Removing the wrong check makes the above example work.
      While at it, also fix the flag check during set instantiation to
      allow supported combinations only.
      
      Fixes: 8aeff920 ("netfilter: nf_tables: add stateful object reference to set elements")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f7ace7f2
    • R
      watchdog: aspeed: Add support for AST2600 · f217883b
      Ryan Chen 提交于
      [ Upstream commit b3528b4874480818e38e4da019d655413c233e6a ]
      
      The ast2600 can be supported by the same code as the ast2500.
      Signed-off-by: NRyan Chen <ryan_chen@aspeedtech.com>
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20190819051738.17370-3-joel@jms.id.auSigned-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NWim Van Sebroeck <wim@linux-watchdog.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f217883b
    • E
      ceph: reconnect connection if session hang in opening state · 520c2a64
      Erqi Chen 提交于
      [ Upstream commit 71a228bc8d65900179e37ac309e678f8c523f133 ]
      
      If client mds session is evicted in CEPH_MDS_SESSION_OPENING state,
      mds won't send session msg to client, and delayed_work skip
      CEPH_MDS_SESSION_OPENING state session, the session hang forever.
      
      Allow ceph_con_keepalive to reconnect a session in OPENING to avoid
      session hang. Also, ensure that we skip sessions in RESTARTING and
      REJECTED states since those states can't be resurrected by issuing
      a keepalive.
      
      Link: https://tracker.ceph.com/issues/41551
      Signed-off-by: Erqi Chen chenerqi@gmail.com
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      520c2a64
    • L
      ceph: fix directories inode i_blkbits initialization · 0275113f
      Luis Henriques 提交于
      [ Upstream commit 750670341a24cb714e624e0fd7da30900ad93752 ]
      
      When filling an inode with info from the MDS, i_blkbits is being
      initialized using fl_stripe_unit, which contains the stripe unit in
      bytes.  Unfortunately, this doesn't make sense for directories as they
      have fl_stripe_unit set to '0'.  This means that i_blkbits will be set
      to 0xff, causing an UBSAN undefined behaviour in i_blocksize():
      
        UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12
        shift exponent 255 is too large for 32-bit type 'int'
      
      Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit
      is zero.
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0275113f
    • I
      xen/pci: reserve MCFG areas earlier · 2bc2a90a
      Igor Druzhinin 提交于
      [ Upstream commit a4098bc6eed5e31e0391bcc068e61804c98138df ]
      
      If MCFG area is not reserved in E820, Xen by default will defer its usage
      until Dom0 registers it explicitly after ACPI parser recognizes it as
      a reserved resource in DSDT. Having it reserved in E820 is not
      mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
      and firmware is free to keep a hole in E820 in that place. Xen doesn't know
      what exactly is inside this hole since it lacks full ACPI view of the
      platform therefore it's potentially harmful to access MCFG region
      without additional checks as some machines are known to provide
      inconsistent information on the size of the region.
      
      Now xen_mcfg_late() runs after acpi_init() which is too late as some basic
      PCI enumeration starts exactly there as well. Trying to register a device
      prior to MCFG reservation causes multiple problems with PCIe extended
      capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are
      no convenient hooks for us to subscribe to so register MCFG areas earlier
      upon the first invocation of xen_add_device(). It should be safe to do once
      since all the boot time buses must have their MCFG areas in MCFG table
      already and we don't support PCI bus hot-plug.
      Signed-off-by: NIgor Druzhinin <igor.druzhinin@citrix.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2bc2a90a
    • C
      9p: avoid attaching writeback_fid on mmap with type PRIVATE · 18dd2b05
      Chengguang Xu 提交于
      [ Upstream commit c87a37ebd40b889178664c2c09cc187334146292 ]
      
      Currently on mmap cache policy, we always attach writeback_fid
      whether mmap type is SHARED or PRIVATE. However, in the use case
      of kata-container which combines 9p(Guest OS) with overlayfs(Host OS),
      this behavior will trigger overlayfs' copy-up when excute command
      inside container.
      
      Link: http://lkml.kernel.org/r/20190820100325.10313-1-cgxu519@zoho.com.cnSigned-off-by: NChengguang Xu <cgxu519@zoho.com.cn>
      Signed-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      18dd2b05