1. 21 2月, 2022 1 次提交
  2. 19 2月, 2022 1 次提交
  3. 17 2月, 2022 1 次提交
  4. 14 2月, 2022 1 次提交
    • E
      net_sched: add __rcu annotation to netdev->qdisc · 5891cd5e
      Eric Dumazet 提交于
      syzbot found a data-race [1] which lead me to add __rcu
      annotations to netdev->qdisc, and proper accessors
      to get LOCKDEP support.
      
      [1]
      BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu
      
      write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
       attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
       dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
       __dev_open+0x2e9/0x3a0 net/core/dev.c:1416
       __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
       rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
       __rtnl_newlink net/core/rtnetlink.c:3489 [inline]
       rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
       rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
       qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
       __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
       tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
       rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2496
       __do_sys_sendmsg net/socket.c:2505 [inline]
       __se_sys_sendmsg net/socket.c:2503 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 470502de ("net: sched: unlock rules update API")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5891cd5e
  5. 12 2月, 2022 2 次提交
    • P
      kfence: make test case compatible with run time set sample interval · 8913c610
      Peng Liu 提交于
      The parameter kfence_sample_interval can be set via boot parameter and
      late shell command, which is convenient for automated tests and KFENCE
      parameter optimization.  However, KFENCE test case just uses
      compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test
      case not run as users desired.  Export kfence_sample_interval, so that
      KFENCE test case can use run-time-set sample interval.
      
      Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.comSigned-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Christian Knig <christian.koenig@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8913c610
    • R
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 0764db9b
      Roman Gushchin 提交于
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
      Tested-by: NJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0764db9b
  6. 09 2月, 2022 1 次提交
  7. 08 2月, 2022 2 次提交
    • R
      PM: s2idle: ACPI: Fix wakeup interrupts handling · cb1f65c1
      Rafael J. Wysocki 提交于
      After commit e3728b50 ("ACPI: PM: s2idle: Avoid possible race
      related to the EC GPE") wakeup interrupts occurring immediately after
      the one discarded by acpi_s2idle_wake() may be missed.  Moreover, if
      the SCI triggers again immediately after the rearming in
      acpi_s2idle_wake(), that wakeup may be missed too.
      
      The problem is that pm_system_irq_wakeup() only calls pm_system_wakeup()
      when pm_wakeup_irq is 0, but that's not the case any more after the
      interrupt causing acpi_s2idle_wake() to run until pm_wakeup_irq is
      cleared by the pm_wakeup_clear() call in s2idle_loop().  However,
      there may be wakeup interrupts occurring in that time frame and if
      that happens, they will be missed.
      
      To address that issue first move the clearing of pm_wakeup_irq to
      the point at which it is known that the interrupt causing
      acpi_s2idle_wake() to tun will be discarded, before rearming the SCI
      for wakeup.  Moreover, because that only reduces the size of the
      time window in which the issue may manifest itself, allow
      pm_system_irq_wakeup() to register two second wakeup interrupts in
      a row and, when discarding the first one, replace it with the second
      one.  [Of course, this assumes that only one wakeup interrupt can be
      discarded in one go, but currently that is the case and I am not
      aware of any plans to change that.]
      
      Fixes: e3728b50 ("ACPI: PM: s2idle: Avoid possible race related to the EC GPE")
      Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      cb1f65c1
    • M
      Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64) · 6bf625a4
      Michael Kelley 提交于
      Using DMA_BIT_MASK(64) as an initializer for a global variable
      causes problems with Clang 12.0.1. The compiler doesn't understand
      that value 64 is excluded from the shift at compile time, resulting
      in a build error.
      
      While this is a compiler problem, avoid the issue by setting up
      the dma_mask memory as part of struct hv_device, and initialize
      it using dma_set_mask().
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Reported-by: NVitaly Chikunov <vt@altlinux.org>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Fixes: 743b237c ("scsi: storvsc: Add Isolation VM support for storvsc driver")
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Tested-by: NNathan Chancellor <nathan@kernel.org>
      Link: https://lore.kernel.org/r/1644176216-12531-1-git-send-email-mikelley@microsoft.comSigned-off-by: NWei Liu <wei.liu@kernel.org>
      6bf625a4
  8. 07 2月, 2022 1 次提交
    • D
      ata: libata-core: Fix ata_dev_config_cpr() · fda17afc
      Damien Le Moal 提交于
      The concurrent positioning ranges log page 47h is a general purpose log
      page and not a subpage of the indentify device log. Using
      ata_identify_page_supported() to test for concurrent positioning ranges
      support is thus wrong. ata_log_supported() must be used.
      
      Furthermore, unlike other advanced ATA features (e.g. NCQ priority),
      accesses to the concurrent positioning ranges log page are not gated by
      a feature bit from the device IDENTIFY data. Since many older drives
      react badly to the READ LOG EXT and/or READ LOG DMA EXT commands isued
      to read device log pages, avoid problems with older drives by limiting
      the concurrent positioning ranges support detection to drives
      implementing at least the ACS-4 ATA standard (major version 11). This
      additional condition effectively turns ata_dev_config_cpr() into a nop
      for older drives, avoiding problems in the field.
      
      Fixes: fe22e1c2 ("libata: support concurrent positioning ranges log")
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215519
      Cc: stable@vger.kernel.org
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Tested-by: NAbderraouf Adjal <adjal.arf@gmail.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
      fda17afc
  9. 05 2月, 2022 2 次提交
  10. 04 2月, 2022 2 次提交
    • A
      ata: libata-core: Introduce ATA_HORKAGE_NO_LOG_DIR horkage · ac9f0c81
      Anton Lundin 提交于
      06f6c4c6 ("ata: libata: add missing ata_identify_page_supported() calls")
      introduced additional calls to ata_identify_page_supported(), thus also
      adding indirectly accesses to the device log directory log page through
      ata_log_supported(). Reading this log page causes SATADOM-ML 3ME devices
      to lock up.
      
      Introduce the horkage flag ATA_HORKAGE_NO_LOG_DIR to prevent accesses to
      the log directory in ata_log_supported() and add a blacklist entry
      with this flag for "SATADOM-ML 3ME" devices.
      
      Fixes: 636f6e2a ("libata: add horkage for missing Identify Device log")
      Cc: stable@vger.kernel.org # v5.10+
      Signed-off-by: NAnton Lundin <glance@acc.umu.se>
      Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
      ac9f0c81
    • I
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv 提交于
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
  11. 03 2月, 2022 6 次提交
  12. 02 2月, 2022 4 次提交
    • T
      NFS: Avoid duplicate uncached readdir calls on eof · e1d2699b
      Trond Myklebust 提交于
      If we've reached the end of the directory, then cache that information
      in the context so that we don't need to do an uncached readdir in order
      to rediscover that fact.
      
      Fixes: 794092c5 ("NFS: Do uncached readdir when we're seeking a cookie in an empty page cache")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e1d2699b
    • H
      Revert "fbdev: Garbage collect fbdev scrolling acceleration, part 1 (from TODO list)" · 1148836f
      Helge Deller 提交于
      This reverts commit b3ec8cdf.
      
      Revert the second (of 2) commits which disabled scrolling acceleration
      in fbcon/fbdev.  It introduced a regression for fbdev-supported graphic
      cards because of the performance penalty by doing screen scrolling by
      software instead of using the existing graphic card 2D hardware
      acceleration.
      
      Console scrolling acceleration was disabled by dropping code which
      checked at runtime the driver hardware capabilities for the
      BINFO_HWACCEL_COPYAREA or FBINFO_HWACCEL_FILLRECT flags and if set, it
      enabled scrollmode SCROLL_MOVE which uses hardware acceleration to move
      screen contents.  After dropping those checks scrollmode was hard-wired
      to SCROLL_REDRAW instead, which forces all graphic cards to redraw every
      character at the new screen position when scrolling.
      
      This change effectively disabled all hardware-based scrolling acceleration for
      ALL drivers, because now all kind of 2D hardware acceleration (bitblt,
      fillrect) in the drivers isn't used any longer.
      
      The original commit message mentions that only 3 DRM drivers (nouveau, omapdrm
      and gma500) used hardware acceleration in the past and thus code for checking
      and using scrolling acceleration is obsolete.
      
      This statement is NOT TRUE, because beside the DRM drivers there are around 35
      other fbdev drivers which depend on fbdev/fbcon and still provide hardware
      acceleration for fbdev/fbcon.
      
      The original commit message also states that syzbot found lots of bugs in fbcon
      and thus it's "often the solution to just delete code and remove features".
      This is true, and the bugs - which actually affected all users of fbcon,
      including DRM - were fixed, or code was dropped like e.g. the support for
      software scrollback in vgacon (commit 973c096f).
      
      So to further analyze which bugs were found by syzbot, I've looked through all
      patches in drivers/video which were tagged with syzbot or syzkaller back to
      year 2005. The vast majority fixed the reported issues on a higher level, e.g.
      when screen is to be resized, or when font size is to be changed. The few ones
      which touched driver code fixed a real driver bug, e.g. by adding a check.
      
      But NONE of those patches touched code of either the SCROLL_MOVE or the
      SCROLL_REDRAW case.
      
      That means, there was no real reason why SCROLL_MOVE had to be ripped-out and
      just SCROLL_REDRAW had to be used instead. The only reason I can imagine so far
      was that SCROLL_MOVE wasn't used by DRM and as such it was assumed that it
      could go away. That argument completely missed the fact that SCROLL_MOVE is
      still heavily used by fbdev (non-DRM) drivers.
      
      Some people mention that using memcpy() instead of the hardware acceleration is
      pretty much the same speed. But that's not true, at least not for older graphic
      cards and machines where we see speed decreases by factor 10 and more and thus
      this change leads to console responsiveness way worse than before.
      
      That's why the original commit is to be reverted. By reverting we
      reintroduce hardware-based scrolling acceleration and fix the
      performance regression for fbdev drivers.
      
      There isn't any impact on DRM when reverting those patches.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NSven Schnelle <svens@stackframe.org>
      Cc: stable@vger.kernel.org # v5.16+
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20220202135531.92183-2-deller@gmx.de
      1148836f
    • K
      net/mlx5e: Use struct_group() for memcpy() region · 6d5c900e
      Kees Cook 提交于
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Use struct_group() in struct vlan_ethhdr around members h_dest and
      h_source, so they can be referenced together. This will allow memcpy()
      and sizeof() to more easily reason about sizes, improve readability,
      and avoid future warnings about writing beyond the end of h_dest.
      
      "pahole" shows no size nor member offset changes to struct vlan_ethhdr.
      "objdump -d" shows no object code changes.
      
      Fixes: 34802a42 ("net/mlx5e: Do not modify the TX SKB")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      6d5c900e
    • D
      netfs, cachefiles: Add a method to query presence of data in the cache · bee9f655
      David Howells 提交于
      Add a netfs_cache_ops method by which a network filesystem can ask the
      cache about what data it has available and where so that it can make a
      multipage read more efficient.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-cachefs@redhat.com
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NRohith Surabattula <rohiths@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      bee9f655
  13. 01 2月, 2022 1 次提交
    • M
      kvm: add guest_state_{enter,exit}_irqoff() · ef9989af
      Mark Rutland 提交于
      When transitioning to/from guest mode, it is necessary to inform
      lockdep, tracing, and RCU in a specific order, similar to the
      requirements for transitions to/from user mode. Additionally, it is
      necessary to perform vtime accounting for a window around running the
      guest, with RCU enabled, such that timer interrupts taken from the guest
      can be accounted as guest time.
      
      Most architectures don't handle all the necessary pieces, and a have a
      number of common bugs, including unsafe usage of RCU during the window
      between guest_enter() and guest_exit().
      
      On x86, this was dealt with across commits:
      
        87fa7f3e ("x86/kvm: Move context tracking where it belongs")
        0642391e ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
        9fc975e9 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
        3ebccdf3 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
        135961e0 ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
        16045714 ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
        bc908e09 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
      
      ... but those fixes are specific to x86, and as the resulting logic
      (while correct) is split across generic helper functions and
      x86-specific helper functions, it is difficult to see that the
      entry/exit accounting is balanced.
      
      This patch adds generic helpers which architectures can use to handle
      guest entry/exit consistently and correctly. The guest_{enter,exit}()
      helpers are split into guest_timing_{enter,exit}() to perform vtime
      accounting, and guest_context_{enter,exit}() to perform the necessary
      context tracking and RCU management. The existing guest_{enter,exit}()
      heleprs are left as wrappers of these.
      
      Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
      helpers are added to handle the ordering of lockdep, tracing, and RCU
      manageent. These are inteneded to mirror exit_to_user_mode() and
      enter_from_user_mode().
      
      Subsequent patches will migrate architectures over to the new helpers,
      following a sequence:
      
      	guest_timing_enter_irqoff();
      
      	guest_state_enter_irqoff();
      	< run the vcpu >
      	guest_state_exit_irqoff();
      
      	< take any pending IRQs >
      
      	guest_timing_exit_irqoff();
      
      This sequences handles all of the above correctly, and more clearly
      balances the entry and exit portions, making it easier to understand.
      
      The existing helpers are marked as deprecated, and will be removed once
      all architectures have been converted.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NNicolas Saenz Julienne <nsaenzju@redhat.com>
      Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ef9989af
  14. 30 1月, 2022 4 次提交
  15. 29 1月, 2022 2 次提交
    • M
      block: add bio_start_io_acct_time() to control start_time · e45c47d1
      Mike Snitzer 提交于
      bio_start_io_acct_time() interface is like bio_start_io_acct() that
      allows start_time to be passed in. This gives drivers the ability to
      defer starting accounting until after IO is issued (but possibily not
      entirely due to bio splitting).
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      e45c47d1
    • V
      security, lsm: dentry_init_security() Handle multi LSM registration · 7f5056b9
      Vivek Goyal 提交于
      A ceph user has reported that ceph is crashing with kernel NULL pointer
      dereference. Following is the backtrace.
      
      /proc/version: Linux version 5.16.2-arch1-1 (linux@archlinux) (gcc (GCC)
      11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Thu, 20 Jan 2022
      16:18:29 +0000
      distro / arch: Arch Linux / x86_64
      SELinux is not enabled
      ceph cluster version: 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
      
      relevant dmesg output:
      [   30.947129] BUG: kernel NULL pointer dereference, address:
      0000000000000000
      [   30.947206] #PF: supervisor read access in kernel mode
      [   30.947258] #PF: error_code(0x0000) - not-present page
      [   30.947310] PGD 0 P4D 0
      [   30.947342] Oops: 0000 [#1] PREEMPT SMP PTI
      [   30.947388] CPU: 5 PID: 778 Comm: touch Not tainted 5.16.2-arch1-1 #1
      86fbf2c313cc37a553d65deb81d98e9dcc2a3659
      [   30.947486] Hardware name: Gigabyte Technology Co., Ltd. B365M
      DS3H/B365M DS3H, BIOS F5 08/13/2019
      [   30.947569] RIP: 0010:strlen+0x0/0x20
      [   30.947616] Code: b6 07 38 d0 74 16 48 83 c7 01 84 c0 74 05 48 39 f7 75
      ec 31 c0 31 d2 89 d6 89 d7 c3 48 89 f8 31 d2 89 d6 89 d7 c3 0
      f 1f 40 00 <80> 3f 00 74 12 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 31
      ff
      [   30.947782] RSP: 0018:ffffa4ed80ffbbb8 EFLAGS: 00010246
      [   30.947836] RAX: 0000000000000000 RBX: ffffa4ed80ffbc60 RCX:
      0000000000000000
      [   30.947904] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
      0000000000000000
      [   30.947971] RBP: ffff94b0d15c0ae0 R08: 0000000000000000 R09:
      0000000000000000
      [   30.948040] R10: 0000000000000000 R11: 0000000000000000 R12:
      0000000000000000
      [   30.948106] R13: 0000000000000001 R14: ffffa4ed80ffbc60 R15:
      0000000000000000
      [   30.948174] FS:  00007fc7520f0740(0000) GS:ffff94b7ced40000(0000)
      knlGS:0000000000000000
      [   30.948252] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   30.948308] CR2: 0000000000000000 CR3: 0000000104a40001 CR4:
      00000000003706e0
      [   30.948376] Call Trace:
      [   30.948404]  <TASK>
      [   30.948431]  ceph_security_init_secctx+0x7b/0x240 [ceph
      49f9c4b9bf5be8760f19f1747e26da33920bce4b]
      [   30.948582]  ceph_atomic_open+0x51e/0x8a0 [ceph
      49f9c4b9bf5be8760f19f1747e26da33920bce4b]
      [   30.948708]  ? get_cached_acl+0x4d/0xa0
      [   30.948759]  path_openat+0x60d/0x1030
      [   30.948809]  do_filp_open+0xa5/0x150
      [   30.948859]  do_sys_openat2+0xc4/0x190
      [   30.948904]  __x64_sys_openat+0x53/0xa0
      [   30.948948]  do_syscall_64+0x5c/0x90
      [   30.948989]  ? exc_page_fault+0x72/0x180
      [   30.949034]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   30.949091] RIP: 0033:0x7fc7521e25bb
      [   30.950849] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00
      00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 0
      0 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14
      25
      
      Core of the problem is that ceph checks for return code from
      security_dentry_init_security() and if return code is 0, it assumes
      everything is fine and continues to call strlen(name), which crashes.
      
      Typically SELinux LSM returns 0 and sets name to "security.selinux" and
      it is not a problem. Or if selinux is not compiled in or disabled, it
      returns -EOPNOTSUP and ceph deals with it.
      
      But somehow in this configuration, 0 is being returned and "name" is
      not being initialized and that's creating the problem.
      
      Our suspicion is that BPF LSM is registering a hook for
      dentry_init_security() and returns hook default of 0.
      
      LSM_HOOK(int, 0, dentry_init_security, struct dentry *dentry,...)
      
      I have not been able to reproduce it just by doing CONFIG_BPF_LSM=y.
      Stephen has tested the patch though and confirms it solves the problem
      for him.
      
      dentry_init_security() is written in such a way that it expects only one
      LSM to register the hook. Atleast that's the expectation with current code.
      
      If another LSM returns a hook and returns default, it will simply return
      0 as of now and that will break ceph.
      
      Hence, suggestion is that change semantics of this hook a bit. If there
      are no LSMs or no LSM is taking ownership and initializing security context,
      then return -EOPNOTSUP. Also allow at max one LSM to initialize security
      context. This hook can't deal with multiple LSMs trying to init security
      context. This patch implements this new behavior.
      Reported-by: NStephen Muth <smuth4@gmail.com>
      Tested-by: NStephen Muth <smuth4@gmail.com>
      Suggested-by: NCasey Schaufler <casey@schaufler-ca.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Reviewed-by: NSerge Hallyn <serge@hallyn.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: <stable@vger.kernel.org> # 5.16.0
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Acked-by: NChristian Brauner <brauner@kernel.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7f5056b9
  16. 28 1月, 2022 1 次提交
  17. 27 1月, 2022 2 次提交
    • L
      pid: Introduce helper task_is_in_init_pid_ns() · d7e4f854
      Leo Yan 提交于
      Currently the kernel uses open code in multiple places to check if a
      task is in the root PID namespace with the kind of format:
      
        if (task_active_pid_ns(current) == &init_pid_ns)
            do_something();
      
      This patch creates a new helper function, task_is_in_init_pid_ns(), it
      returns true if a passed task is in the root PID namespace, otherwise
      returns false.  So it will be used to replace open codes.
      Suggested-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Acked-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d7e4f854
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · ebb7fb15
      Dave Chinner 提交于
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ebb7fb15
  18. 26 1月, 2022 2 次提交
  19. 24 1月, 2022 2 次提交
  20. 22 1月, 2022 2 次提交