1. 27 12月, 2019 40 次提交
    • J
      genirq: Provide NMI handlers · 51faab01
      Julien Thierry 提交于
      hulk inclusion
      category: feature
      bugzilla: 9290
      CVE: NA
      
      ported from https://lore.kernel.org/patchwork/patch/1037463/
      
      -------------------------------------------------
      
      Provide flow handlers that are NMI safe for interrupts and percpu_devid
      interrupts.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      51faab01
    • J
      genirq: Provide NMI management for percpu_devid interrupts · 11fb5378
      Julien Thierry 提交于
      hulk inclusion
      category: feature
      bugzilla: 9290
      CVE: NA
      
      ported from https://lore.kernel.org/patchwork/patch/1037461/
      
      -------------------------------------------------
      
      Add support for percpu_devid interrupts treated as NMIs.
      
      Percpu_devid NMIs need to be setup/torn down on each CPU they target.
      
      The same restrictions as for global NMIs still apply for percpu_devid NMIs.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      11fb5378
    • J
      genirq: Provide basic NMI management for interrupt lines · efe69ab8
      Julien Thierry 提交于
      hulk inclusion
      category: feature
      bugzilla: 9290
      CVE: NA
      
      ported from https://lore.kernel.org/patchwork/patch/1037460/
      
      -------------------------------------------------
      
      Add functionality to allocate interrupt lines that will deliver IRQs
      as Non-Maskable Interrupts. These allocations are only successful if
      the irqchip provides the necessary support and allows NMI delivery for the
      interrupt line.
      
      Interrupt lines allocated for NMI delivery must be enabled/disabled through
      enable_nmi/disable_nmi_nosync to keep their state consistent.
      
      To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded,
      the irqchip directly managing the IRQ must be the root irqchip and the
      irqchip cannot be behind a slow bus.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      efe69ab8
    • M
      blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · d4118616
      Ming Lei 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d
      category: bugfix
      bugzilla: 5901
      CVE: NA
      ---------------------------
      
      Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
      from block layer's view, actually they don't because userspace may
      grab one kobject anytime via sysfs.
      
      This patch fixes the issue by the following approach:
      
      1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
      all ctxs
      
      2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
      handler of .mq_kobj
      
      3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
      .mq_kobj is always released after all ctxs are freed.
      
      This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
      is enabled.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d4118616
    • R
      cpuidle: menu: Fix wakeup statistics updates for polling state · d3b52881
      Rafael J. Wysocki 提交于
      mainline inclusion
      from mainline-4.20
      commit 5f26bdceb9c0
      category: bugfix
      bugzilla: 6468
      CVE: NA
      
      -------------------------------------------------
      
      If the CPU exits the "polling" state due to the time limit in the
      loop in poll_idle(), this is not a real wakeup and it just means
      that the "polling" state selection was not adequate.  The governor
      mispredicted short idle duration, but had a more suitable state been
      selected, the CPU might have spent more time in it.  In fact, there
      is no reason to expect that there would have been a wakeup event
      earlier than the next timer in that case.
      
      Handling such cases as regular wakeups in menu_update() may cause the
      menu governor to make suboptimal decisions going forward, but ignoring
      them altogether would not be correct either, because every time
      menu_select() is invoked, it makes a separate new attempt to predict
      the idle duration taking distinct time to the closest timer event as
      input and the outcomes of all those attempts should be recorded.
      
      For this reason, make menu_update() always assume that if the
      "polling" state was exited due to the time limit, the next proper
      wakeup event for the CPU would be the next timer event (not
      including the tick).
      
      Fixes: a37b969a "cpuidle: poll_state: Add time limit to poll_idle()"
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d3b52881
    • C
      dmaengine: dw-dmac: implement dma protection control setting · c80fabc6
      Christian Lamparter 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 7b0c03ecc42fb223baf015877fee9d517c2c8af1
      category: bugfix
      bugzilla: 6517
      CVE: NA
      
      ---------------------------
      
      This patch adds a new device-tree property that allows to
      specify the dma protection control bits for the all of the
      DMA controller's channel uniformly.
      
      Setting the "correct" bits can have a huge impact on the
      PPC460EX and APM82181 that use this DMA engine in combination
      with a DesignWare' SATA-II core (sata_dwc_460ex driver).
      
      In the OpenWrt Forum, the user takimata reported that:
      |It seems your patch unleashed the full power of the SATA port.
      |Where I was previously hitting a really hard limit at around
      |82 MB/s for reading and 27 MB/s for writing, I am now getting this:
      |
      |root@OpenWrt:/mnt# time dd if=/dev/zero of=tempfile bs=1M count=1024
      |1024+0 records in
      |1024+0 records out
      |real    0m 13.65s
      |user    0m 0.01s
      |sys     0m 11.89s
      |
      |root@OpenWrt:/mnt# time dd if=tempfile of=/dev/null bs=1M count=1024
      |1024+0 records in
      |1024+0 records out
      |real    0m 8.41s
      |user    0m 0.01s
      |sys     0m 4.70s
      |
      |This means: 121 MB/s reading and 75 MB/s writing!
      |
      |The drive is a WD Green WD10EARX taken from an older MBL Single.
      |I repeated the test a few times with even larger files to rule out
      |any caching, I'm still seeing the same great performance. OpenWrt is
      |now completely on par with the original MBL firmware's performance.
      
      Another user And.short reported:
      |I can report that your fix worked! Boots up fine with two
      |drives even with more partitions, and no more reboot on
      |concurrent disk access!
      
      A closer look into the sata_dwc_460ex code revealed that
      the driver did initally set the correct protection control
      bits. However, this feature was lost when the sata_dwc_460ex
      driver was converted to the generic DMA driver framework.
      
      BugLink: https://forum.openwrt.org/t/wd-mybook-live-duo-two-disks/16195/55
      BugLink: https://forum.openwrt.org/t/wd-mybook-live-duo-two-disks/16195/50
      Fixes: 8b344485 ("sata_dwc_460ex: move to generic DMA driver")
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NChristian Lamparter <chunkeey@gmail.com>
      Signed-off-by: NVinod Koul <vkoul@kernel.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c80fabc6
    • S
      include/linux/notifier.h: SRCU: fix ctags · 89073af9
      Sam Protsenko 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7
      category: bugfix
      bugzilla: 5542
      CVE: NA
      
      ---------------------------
      
      ctags indexing ("make tags" command) throws this warning:
      
          ctags: Warning: include/linux/notifier.h:125:
          null expansion of name pattern "\1"
      
      This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
      getting rid of line break.
      
      Similar fix was already done in commit 25528213 ("tags: Fix
      DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.
      
      Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
      Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
      Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      89073af9
    • M
      bitops: protect variables in bit_clear_unless() macro · 18781385
      Miklos Szeredi 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit edfa87281f4fa1b78a21f6db999935a2faa2f6b8
      category: bugfix
      bugzilla: 5544
      CVE: NA
      
      ---------------------------
      
      Unprotected naming of local variables within bit_clear_unless() can easily
      lead to using the wrong scope.
      
      Noticed this by code review after having hit this issue in set_mask_bits()
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 85ad1d13 ("md: set MD_CHANGE_PENDING in a atomic region")
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      18781385
    • R
      linux/bitmap.h: fix type of nbits in bitmap_shift_right() · 40ddf945
      Rasmus Villemoes 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit d9873969fa8725dc6a5a21ab788c057fd8719751
      category: bugfix
      bugzilla: 5555
      CVE: NA
      
      ---------------------------
      
      Most other bitmap API, including the OOL version __bitmap_shift_right,
      take unsigned nbits.  This was accidentally left out from 2fbad299.
      
      Link: http://lkml.kernel.org/r/20180818131623.8755-5-linux@rasmusvillemoes.dk
      Fixes: 2fbad299 ("lib: bitmap: change bitmap_shift_right to take unsigned parameters")
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: NYury Norov <ynorov@caviumnetworks.com>
      Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      40ddf945
    • D
      UAPI: ndctl: Remove use of PAGE_SIZE · e89320ff
      David Howells 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit f366d322aea782cf786aa821d5accdc1609f9e10
      category: bugfix
      bugzilla: 5557
      CVE: NA
      
      ---------------------------
      
      The macro PAGE_SIZE isn't valid outside of the kernel, so it should not
      appear in UAPI headers.
      
      Furthermore, the actual machine page size could theoretically change from
      an application's point of view if it's running in a container that gets
      migrated to another machine (say 4K/ppc64 to 64K/ppc64).
      
      Fixes: f2ba5a5b ("libnvdimm, namespace: make min namespace size 4K")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e89320ff
    • M
      bitops: protect variables in set_mask_bits() macro · 8a2b4f79
      Miklos Szeredi 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 18127429a854e7607b859484880b8e26cee9ddab
      category: bugfix
      bugzilla: 5559
      CVE: NA
      
      ---------------------------
      
      Unprotected naming of local variables within the set_mask_bits() can easily
      lead to using the wrong scope.
      
      Noticed this when "set_mask_bits(&foo->bar, 0, mask)" behaved as no-op.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 00a1a053 ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()")
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8a2b4f79
    • M
      net/mlx5: Fix atomic_mode enum values · 12d4e210
      Moni Shoua 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit aa7e80b220f3a543eefbe4b7e2c5d2b73e2e2ef7
      category: bugfix
      bugzilla: 5560
      CVE: NA
      
      ---------------------------
      
      The field atomic_mode is 4 bits wide and therefore can hold values
      from 0x0 to 0xf. Remove the unnecessary 20 bit shift that made the values
      be incorrect. While that, remove unused enum values.
      
      Fixes: 57cda166 ("net/mlx5: Add DCT command interface")
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      12d4e210
    • R
      gpio: fix kernel-doc notation warning for 'request_key' · 3f446a97
      Randy Dunlap 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 02ad0437decf2e5dba975c23b1a89775f4b211e1
      category: bugfix
      bugzilla: 5562
      CVE: NA
      
      ---------------------------
      
      Fix kernel-doc warning for missing struct member 'request_key':
      
      ../include/linux/gpio/driver.h:142: warning: Function parameter or member 'request_key' not described in 'gpio_irq_chip'
      
      Fixes: 39c3fd58 ("kernel/irq: Extend lockdep class for request mutex")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: linux-gpio@vger.kernel.org
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3f446a97
    • K
      kernfs: update comment about kernfs_path() return value · 0435096f
      Konstantin Khlebnikov 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 8f5be0ec23bb9ef3f96659c8dff1340b876600bf
      category: bugfix
      bugzilla: 5564
      CVE: NA
      
      ---------------------------
      
      Now it returns the length of the full path or error code.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: 3abb1d90 ("kernfs: make kernfs_path*() behave in the style of strlcpy()")
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0435096f
    • M
      build_bug.h: remove most of dummy BUILD_BUG_ON stubs for Sparse · 9baadf68
      Masahiro Yamada 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 527edbc18a70e745740ef31edb0ffefb2f161afa
      category: bugfix
      bugzilla: 6820
      CVE: NA
      
      ---------------------------
      
      The introduction of these dummy BUILD_BUG_ON stubs dates back to commmit
      903c0c7c ("sparse: define dummy BUILD_BUG_ON definition for
      sparse").
      
      At that time, BUILD_BUG_ON() was implemented with the negative array
      trick *and* the link-time trick, like this:
      
        extern int __build_bug_on_failed;
        #define BUILD_BUG_ON(condition)                                \
                do {                                                   \
                        ((void)sizeof(char[1 - 2*!!(condition)]));     \
                        if (condition) __build_bug_on_failed = 1;      \
                } while(0)
      
      Sparse is more strict about the negative array trick than GCC because
      Sparse requires the array length to be really constant.
      
      Here is the simple test code for the macro above:
      
        static const int x = 0;
        BUILD_BUG_ON(x);
      
      GCC is absolutely fine with it (-Wvla was enabled only very recently),
      but Sparse warns like this:
      
        error: bad constant expression
        error: cannot size expression
      
      (If you are using a newer version of Sparse, you will see a different
      warning message, "warning: Variable length array is used".)
      
      Anyway, Sparse was producing many false positives, and noisier than it
      should be at that time.
      
      With the previous commit, the leftover negative array trick is gone.
      Sparse is fine with the current BUILD_BUG_ON(), which is implemented by
      using the 'error' attribute.
      
      I am keeping the stub for BUILD_BUG_ON_ZERO().  Otherwise, Sparse would
      complain about the following code, which GCC is fine with:
      
        static const int x = 0;
        int y = BUILD_BUG_ON_ZERO(x);
      
      Link: http://lkml.kernel.org/r/1542856462-18836-3-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9baadf68
    • M
      include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR · 05505fc0
      Michael S. Tsirkin 提交于
      mainline inclusion
      from mainline-v5.0-rc3
      commit 3e2ffd655cc6a694608d997738989ff5572a8266
      category: bugfix
      bugzilla: 7094
      CVE: NA
      
      ---------------------------
      
      Since commit 815f0ddb ("include/linux/compiler*.h: make compiler-*.h
      mutually exclusive") clang no longer reuses the OPTIMIZER_HIDE_VAR macro
      from compiler-gcc - instead it gets the version in
      include/linux/compiler.h.  Unfortunately that version doesn't actually
      prevent compiler from optimizing out the variable.
      
      Fix up by moving the macro out from compiler-gcc.h to compiler.h.
      Compilers without incline asm support will keep working
      since it's protected by an ifdef.
      
      Also fix up comments to match reality since we are no longer overriding
      any macros.
      
      Build-tested with gcc and clang.
      
      Fixes: 815f0ddb ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
      Cc: Eli Friedman <efriedma@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      05505fc0
    • R
      of: Fix property name in of_node_get_device_type · 4e3cc0cf
      Rob Herring 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 5d5a0ab1a7918fce5ca5c0fb1871a3e2000f85de
      category: bugfix
      bugzilla: 5547
      CVE: NA
      
      ---------------------------
      
      Commit 0413beda ("of: Add device_type access helper functions")
      added a new helper not yet used in preparation for some treewide clean
      up of accesses to 'device_type' properties. Unfortunately, there's an
      error and 'type' was used for the property name. Fix this.
      
      Fixes: 0413beda ("of: Add device_type access helper functions")
      Cc: Frank Rowand <frowand.list@gmail.com>
      Signed-off-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4e3cc0cf
    • J
      gpio: drop broken to_gpio_irq_chip() helper · 7c7db3d1
      Johan Hovold 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit eee3919c5f2949a8b7b1e9fa239d153be1538656
      category: bugfix
      bugzilla: 5548
      CVE: NA
      
      ---------------------------
      
      Drop the broken to_gpio_irq_chip() container_of() helper, which would
      break the build for anyone who tries to use it.
      
      Specifically, struct gpio_irq_chip only holds a pointer to a struct
      irq_chip so using container_of() on an irq-chip pointer makes no sense.
      
      Fixes: da80ff81 ("gpio: Move irqchip into struct gpio_irq_chip")
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NJohan Hovold <johan@kernel.org>
      Reviewed-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7c7db3d1
    • R
      srcu: Fix kernel-doc missing notation · 6533f24f
      Randy Dunlap 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit f3e763c3e544b73ae5c4a3842cedb9ff6ca37715
      category: bugfix
      bugzilla: 5552
      CVE: NA
      
      ---------------------------
      
      Fix kernel-doc warnings for missing parameter descriptions:
      
      ../include/linux/srcu.h:175: warning: Function parameter or member 'p' not described in 'srcu_dereference_notrace'
      ../include/linux/srcu.h:175: warning: Function parameter or member 'sp' not described in 'srcu_dereference_notrace'
      
      Fixes: 0b764a6e ("srcu: Add notrace variant of srcu_dereference")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6533f24f
    • M
      compiler.h: give up __compiletime_assert_fallback() · bca68b4a
      Masahiro Yamada 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 81b45683487a51b0f4d3b29d37f20d6d078544e4
      category: bugfix
      bugzilla: 5553
      CVE: NA
      
      ---------------------------
      
      __compiletime_assert_fallback() is supposed to stop building earlier
      by using the negative-array-size method in case the compiler does not
      support "error" attribute, but has never worked like that.
      
      You can simply try:
      
          BUILD_BUG_ON(1);
      
      GCC immediately terminates the build, but Clang does not report
      anything because Clang does not support the "error" attribute now.
      It will later fail at link time, but __compiletime_assert_fallback()
      is not working at least.
      
      The root cause is commit 1d6a0d19 ("bug.h: prevent double evaluation
      of `condition' in BUILD_BUG_ON").  Prior to that commit, BUILD_BUG_ON()
      was checked by the negative-array-size method *and* the link-time trick.
      Since that commit, the negative-array-size is not effective because
      '__cond' is no longer constant.  As the comment in <linux/build_bug.h>
      says, GCC (and Clang as well) only emits the error for obvious cases.
      
      When '__cond' is a variable,
      
          ((void)sizeof(char[1 - 2 * __cond]))
      
      ... is not obvious for the compiler to know the array size is negative.
      
      Reverting that commit would break BUILD_BUG() because negative-size-array
      is evaluated before the code is optimized out.
      
      Let's give up __compiletime_assert_fallback().  This commit does not
      change the current behavior since it just rips off the useless code.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bca68b4a
    • K
      gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB · f46a293c
      Krzysztof Kozlowski 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit c5510b8dafce5f3f5a039c9b262ebcae0092c462
      category: bugfix
      bugzilla: 5539
      CVE: NA
      
      ---------------------------
      
      If CONFIG_GPOILIB is not set, the stub of gpio_to_desc() should return
      the same type of error as regular version: NULL.  All the callers
      compare the return value of gpio_to_desc() against NULL, so returned
      ERR_PTR would be treated as non-error case leading to dereferencing of
      error value.
      
      Fixes: 79a9becd ("gpiolib: export descriptor-based GPIO interface")
      Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f46a293c
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · d396de12
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d396de12
    • J
      kvm: Change offset in kvm_write_guest_offset_cached to unsigned · d7023bb1
      Jim Mattson 提交于
      [ Upstream commit 7a86dab8cf2f0fdf508f3555dddfc236623bff60 ]
      
      Since the offset is added directly to the hva from the
      gfn_to_hva_cache, a negative offset could result in an out of bounds
      write. The existing BUG_ON only checks for addresses beyond the end of
      the gfn_to_hva_cache, not for addresses before the start of the
      gfn_to_hva_cache.
      
      Note that all current call sites have non-negative offsets.
      
      Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
      Reported-by: NCfir Cohen <cfir@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NCfir Cohen <cfir@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d7023bb1
    • N
      drbd: Avoid Clang warning about pointless switch statment · 2b750730
      Nathan Chancellor 提交于
      [ Upstream commit a52c5a16cf19d8a85831bb1b915a221dd4ffae3c ]
      
      There are several warnings from Clang about no case statement matching
      the constant 0:
      
      In file included from drivers/block/drbd/drbd_receiver.c:48:
      In file included from drivers/block/drbd/drbd_int.h:48:
      In file included from ./include/linux/drbd_genl_api.h:54:
      In file included from ./include/linux/genl_magic_struct.h:236:
      ./include/linux/drbd_genl.h:321:1: warning: no case matching constant
      switch condition '0'
      GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
      'GENL_struct'
              switch (0) {
                      ^
      
      Silence this warning by adding a 'case 0:' statement. Additionally,
      adjust the alignment of the statements in the ct_assert_unique macro to
      avoid a checkpatch warning.
      
      This solution was originally sent by Arnd Bergmann with a default case
      statement: https://lore.kernel.org/patchwork/patch/756723/
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/43Suggested-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2b750730
    • S
      net/mlx5: EQ, Use the right place to store/read IRQ affinity hint · 64042b03
      Saeed Mahameed 提交于
      [ Upstream commit 1e86ace4c140fd5a693e266c9b23409358f25381 ]
      
      Currently the cpu affinity hint mask for completion EQs is stored and
      read from the wrong place, since reading and storing is done from the
      same index, there is no actual issue with that, but internal irq_info
      for completion EQs stars at MLX5_EQ_VEC_COMP_BASE offset in irq_info
      array, this patch changes the code to use the correct offset to store
      and read the IRQ affinity hint.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      64042b03
    • M
      gpiolib: Fix possible use after free on label · b11d82db
      Muchun Song 提交于
      [ Upstream commit 18534df419041e6c1f4b41af56ee7d41f757815c ]
      
      gpiod_request_commit() copies the pointer to the label passed as
      an argument only to be used later. But there's a chance the caller
      could immediately free the passed string(e.g., local variable).
      This could trigger a use after free when we use gpio label(e.g.,
      gpiochip_unlock_as_irq(), gpiochip_is_requested()).
      
      To be on the safe side: duplicate the string with kstrdup_const()
      so that if an unaware user passes an address to a stack-allocated
      buffer, we won't get the arbitrary label.
      
      Also fix gpiod_set_consumer_name().
      Signed-off-by: NMuchun Song <smuchun@gmail.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b11d82db
    • V
      HID: debug: fix the ring buffer implementation · 6b0c9002
      Vladis Dronov 提交于
      mainline inclusion
      from mainline-5.0
      commit 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035
      category: bugfix
      bugzilla: NA
      CVE: CVE-2019-3819
      
      -------------------------------------------------
      
      Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
      is strange allowing lost or corrupted data. After commit 717adfda
      ("HID: debug: check length before copy_to_user()") it is possible to enter
      an infinite loop in hid_debug_events_read() by providing 0 as count, this
      locks up a system. Fix this by rewriting the ring buffer implementation
      with kfifo and simplify the code.
      
      This fixes CVE-2019-3819.
      
      v2: fix an execution logic and add a comment
      v3: use __set_current_state() instead of set_current_state()
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
      Cc: stable@vger.kernel.org # v4.18+
      Fixes: cd667ce2 ("HID: use debugfs for events/reports dumping")
      Fixes: 717adfda ("HID: debug: check length before copy_to_user()")
      Signed-off-by: NVladis Dronov <vdronov@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Acked-by: Nzhangyi (F) <yi.zhang@huawei.com>
      6b0c9002
    • Q
      mm/memblock.c: skip kmemleak for kasan_init() · ad5aa6ec
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit fed84c78527009d4f799a3ed9a566502fa026d82
      category: bugfix
      bugzilla: 7440
      CVE: NA
      
      -------------------------------------------------
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NZengruan Ye <yezengruan@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ad5aa6ec
    • F
      of: overlay: add tests to validate kfrees from overlay removal · 5969ad33
      Frank Rowand 提交于
      commit 144552c786925314c1e7cb8f91a71dae1aca8798 upstream.
      
      Add checks:
        - attempted kfree due to refcount reaching zero before overlay
          is removed
        - properties linked to an overlay node when the node is removed
        - node refcount > one during node removal in a changeset destroy,
          if the node was created by the changeset
      
      After applying this patch, several validation warnings will be
      reported from the devicetree unittest during boot due to
      pre-existing devicetree bugs. The warnings will be similar to:
      
        OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11
        OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/
        hvac-medium-2
      Tested-by: NAlan Tull <atull@kernel.org>
      Signed-off-by: NFrank Rowand <frank.rowand@sony.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5969ad33
    • T
      oom, oom_reaper: do not enqueue same task twice · 0b13c78a
      Tetsuo Handa 提交于
      commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0b13c78a
    • D
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · 15ef6c67
      Daniel Borkmann 提交于
      [ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      15ef6c67
    • D
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · 52cf1114
      Daniel Borkmann 提交于
      [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
      
      While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      52cf1114
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · cedaa820
      Daniel Borkmann 提交于
      [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cedaa820
    • D
      bpf: enable access to ax register also from verifier rewrite · 2facace8
      Daniel Borkmann 提交于
      [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2facace8
    • D
      bpf: move tmp variable into ax register in interpreter · 648f6859
      Daniel Borkmann 提交于
      [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      648f6859
    • D
      bpf: move {prev_,}insn_idx into verifier env · f38a7d20
      Daniel Borkmann 提交于
      [ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f38a7d20
    • D
      Drivers: hv: vmbus: Check for ring when getting debug info · 028e002b
      Dexuan Cui 提交于
      commit ba50bf1ce9a51fc97db58b96d01306aa70bc3979 upstream.
      
      fc96df16a1ce is good and can already fix the "return stack garbage" issue,
      but let's also improve hv_ringbuffer_get_debuginfo(), which would silently
      return stack garbage, if people forget to check channel->state or
      ring_info->ring_buffer, when using the function in the future.
      
      Having an error check in the function would eliminate the potential risk.
      
      Add a Fixes tag to indicate the patch depdendency.
      
      Fixes: fc96df16a1ce ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels")
      Cc: stable@vger.kernel.org
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      028e002b
    • R
      net: Fix usage of pskb_trim_rcsum · ff3f0d43
      Ross Lagerwall 提交于
      [ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]
      
      In certain cases, pskb_trim_rcsum() may change skb pointers.
      Reinitialize header pointers afterwards to avoid potential
      use-after-frees. Add a note in the documentation of
      pskb_trim_rcsum(). Found by KASAN.
      Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ff3f0d43
    • X
      sched.h: depend on resctl & intel_rdt · 21d8578b
      Xie XiuQi 提交于
      hulk inclusion
      category: feature
      bugzilla: 5510
      CVE: NA
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      21d8578b
    • Y
      resctrlfs: mpam: init struct for mpam · 1abcabe9
      Yang Yingliang 提交于
      hulk inclusion
      category: feature
      bugzilla: 5510
      CVE: NA
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1abcabe9