1. 17 11月, 2020 2 次提交
    • J
      sched/deadline: Fix priority inheritance with multiple scheduling classes · 2279f540
      Juri Lelli 提交于
      Glenn reported that "an application [he developed produces] a BUG in
      deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
      PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
      task that was boosted by a SCHED_DEADLINE task boosts another CFS task
      (nested priority inheritance).
      
       ------------[ cut here ]------------
       kernel BUG at kernel/sched/deadline.c:1462!
       invalid opcode: 0000 [#1] PREEMPT SMP
       CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
       Hardware name: ...
       RIP: 0010:enqueue_task_dl+0x335/0x910
       Code: ...
       RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
       RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
       RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
       RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
       R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
       R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
       FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        ? intel_pstate_update_util_hwp+0x13/0x170
        rt_mutex_setprio+0x1cc/0x4b0
        task_blocks_on_rt_mutex+0x225/0x260
        rt_spin_lock_slowlock_locked+0xab/0x2d0
        rt_spin_lock_slowlock+0x50/0x80
        hrtimer_grab_expiry_lock+0x20/0x30
        hrtimer_cancel+0x13/0x30
        do_nanosleep+0xa0/0x150
        hrtimer_nanosleep+0xe1/0x230
        ? __hrtimer_init_sleeper+0x60/0x60
        __x64_sys_nanosleep+0x8d/0xa0
        do_syscall_64+0x4a/0x100
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fa58b52330d
       ...
       ---[ end trace 0000000000000002 ]—
      
      He also provided a simple reproducer creating the situation below:
      
       So the execution order of locking steps are the following
       (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
       are mutexes that are enabled * with priority inheritance.)
      
       Time moves forward as this timeline goes down:
      
       N1              N2               D1
       |               |                |
       |               |                |
       Lock(M1)        |                |
       |               |                |
       |             Lock(M2)           |
       |               |                |
       |               |              Lock(M2)
       |               |                |
       |             Lock(M1)           |
       |             (!!bug triggered!) |
      
      Daniel reported a similar situation as well, by just letting ksoftirqd
      run with DEADLINE (and eventually block on a mutex).
      
      Problem is that boosted entities (Priority Inheritance) use static
      DEADLINE parameters of the top priority waiter. However, there might be
      cases where top waiter could be a non-DEADLINE entity that is currently
      boosted by a DEADLINE entity from a different lock chain (i.e., nested
      priority chains involving entities of non-DEADLINE classes). In this
      case, top waiter static DEADLINE parameters could be null (initialized
      to 0 at fork()) and replenish_dl_entity() would hit a BUG().
      
      Fix this by keeping track of the original donor and using its parameters
      when a task is boosted.
      Reported-by: NGlenn Elliott <glenn@aurora.tech>
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
      2279f540
    • P
      sched: Fix data-race in wakeup · f97bb527
      Peter Zijlstra 提交于
      Mel reported that on some ARM64 platforms loadavg goes bananas and
      Will tracked it down to the following race:
      
        CPU0					CPU1
      
        schedule()
          prev->sched_contributes_to_load = X;
          deactivate_task(prev);
      
      					try_to_wake_up()
      					  if (p->on_rq &&) // false
      					  if (smp_load_acquire(&p->on_cpu) && // true
      					      ttwu_queue_wakelist())
      					        p->sched_remote_wakeup = Y;
      
          smp_store_release(prev->on_cpu, 0);
      
      where both p->sched_contributes_to_load and p->sched_remote_wakeup are
      in the same word, and thus the stores X and Y race (and can clobber
      one another's data).
      
      Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu()
      spinning on p->on_cpu") the p->on_cpu handoff serialized access to
      p->sched_remote_wakeup (just as it still does with
      p->sched_contributes_to_load) that commit broke that by calling
      ttwu_queue_wakelist() with p->on_cpu != 0.
      
      However, due to
      
        p->XXX = X			ttwu()
        schedule()			  if (p->on_rq && ...) // false
          smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
          deactivate_task()		      ttwu_queue_wakelist())
            p->on_rq = 0;		        p->sched_remote_wakeup = Y;
      
      We can be sure any 'current' store is complete and 'current' is
      guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
      the current flags word.
      
      Note: while the observed failure was loadavg accounting gone wrong due
      to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
      is also possible where schedule() clobbers p->sched_remote_wakeup,
      this could result in enqueue_entity() wrecking ->vruntime and causing
      scheduling artifacts.
      
      Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Debugged-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
      f97bb527
  2. 12 10月, 2020 2 次提交
  3. 10 10月, 2020 1 次提交
    • M
      genirq/irqdomain: Allow partial trimming of irq_data hierarchy · 55567976
      Marc Zyngier 提交于
      It appears that some HW is ugly enough that not all the interrupts
      connected to a particular interrupt controller end up with the same
      hierarchy depth (some of them are terminated early). This leaves
      the irqchip hacker with only two choices, both equally bad:
      
      - create discrete domain chains, one for each "hierarchy depth",
        which is very hard to maintain
      
      - create fake hierarchy levels for the shallow paths, leading
        to all kind of problems (what are the safe hwirq values for these
        fake levels?)
      
      Implement the ability to cut short a single interrupt hierarchy
      from a level marked as being disconnected by using the new
      irq_domain_disconnect_hierarchy() helper.
      
      The irqdomain allocation code will then perform the trimming
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      55567976
  4. 09 10月, 2020 3 次提交
  5. 08 10月, 2020 1 次提交
  6. 07 10月, 2020 1 次提交
    • T
      x86/mce: Recover from poison found while copying from user space · c0ab7ffc
      Tony Luck 提交于
      Existing kernel code can only recover from a machine check on code that
      is tagged in the exception table with a fault handling recovery path.
      
      Add two new fields in the task structure to pass information from
      machine check handler to the "task_work" that is queued to run before
      the task returns to user mode:
      
      + mce_vaddr: will be initialized to the user virtual address of the fault
        in the case where the fault occurred in the kernel copying data from
        a user address.  This is so that kill_me_maybe() can provide that
        information to the user SIGBUS handler.
      
      + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
        to determine if mce_vaddr is applicable to this error.
      
      Add code to recover from a machine check while copying data from user
      space to the kernel. Action for this case is the same as if the user
      touched the poison directly; unmap the page and send a SIGBUS to the task.
      
      Use a new helper function to share common code between the "fault
      in user mode" case and the "fault while copying from user" case.
      
      New code paths will be activated by the next patch which sets
      MCE_IN_KERNEL_COPYIN.
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
      c0ab7ffc
  7. 06 10月, 2020 2 次提交
  8. 03 10月, 2020 7 次提交
  9. 02 10月, 2020 1 次提交
    • L
      pipe: remove pipe_wait() and fix wakeup race with splice · 472e5b05
      Linus Torvalds 提交于
      The pipe splice code still used the old model of waiting for pipe IO by
      using a non-specific "pipe_wait()" that waited for any pipe event to
      happen, which depended on all pipe IO being entirely serialized by the
      pipe lock.  So by checking the state you were waiting for, and then
      adding yourself to the wait queue before dropping the lock, you were
      guaranteed to see all the wakeups.
      
      Strictly speaking, the actual wakeups were not done under the lock, but
      the pipe_wait() model still worked, because since the waiter held the
      lock when checking whether it should sleep, it would always see the
      current state, and the wakeup was always done after updating the state.
      
      However, commit 0ddad21d ("pipe: use exclusive waits when reading or
      writing") split the single wait-queue into two, and in the process also
      made the "wait for event" code wait for _two_ wait queues, and that then
      showed a race with the wakers that were not serialized by the pipe lock.
      
      It's only splice that used that "pipe_wait()" model, so the problem
      wasn't obvious, but Josef Bacik reports:
      
       "I hit a hang with fstest btrfs/187, which does a btrfs send into
        /dev/null. This works by creating a pipe, the write side is given to
        the kernel to write into, and the read side is handed to a thread that
        splices into a file, in this case /dev/null.
      
        The box that was hung had the write side stuck here [pipe_write] and
        the read side stuck here [splice_from_pipe_next -> pipe_wait].
      
        [ more details about pipe_wait() scenario ]
      
        The problem is we're doing the prepare_to_wait, which sets our state
        each time, however we can be woken up either with reads or writes. In
        the case above we race with the WRITER waking us up, and re-set our
        state to INTERRUPTIBLE, and thus never break out of schedule"
      
      Josef had a patch that avoided the issue in pipe_wait() by just making
      it set the state only once, but the deeper problem is that pipe_wait()
      depends on a level of synchonization by the pipe mutex that it really
      shouldn't.  And the whole "wait for any pipe state change" model really
      isn't very good to begin with.
      
      So rather than trying to work around things in pipe_wait(), remove that
      legacy model of "wait for arbitrary pipe event" entirely, and actually
      create functions that wait for the pipe actually being readable or
      writable, and can do so without depending on the pipe lock serializing
      everything.
      
      Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
      Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-and-tested-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      472e5b05
  10. 01 10月, 2020 3 次提交
  11. 30 9月, 2020 4 次提交
    • P
      Partially revert "video: fbdev: amba-clcd: Retire elder CLCD driver" · 112c3523
      Peter Collingbourne 提交于
      Also partially revert the follow-up change "drm: pl111: Absorb the
      external register header".
      
      This reverts the parts of commits
      7e4e589d and
      0fb81256 that touch paths outside
      of drivers/gpu/drm/pl111.
      
      The fbdev driver is used by Android's FVP configuration. Using the
      DRM driver together with DRM's fbdev emulation results in a failure
      to boot Android. The root cause is that Android's generic fbdev
      userspace driver relies on the ability to set the pixel format via
      FBIOPUT_VSCREENINFO, which is not supported by fbdev emulation.
      
      There have been other less critical behavioral differences identified
      between the fbdev driver and the DRM driver with fbdev emulation. The
      DRM driver exposes different values for the panel's width, height and
      refresh rate, and the DRM driver fails a FBIOPUT_VSCREENINFO syscall
      with yres_virtual greater than the maximum supported value instead
      of letting the syscall succeed and setting yres_virtual based on yres.
      Signed-off-by: NPeter Collingbourne <pcc@google.com>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200929195344.2219796-1-pcc@google.com
      112c3523
    • A
      efi: efivars: un-export efivars_sysfs_init() · 5d3c8617
      Ard Biesheuvel 提交于
      efivars_sysfs_init() is only used locally in the source file that
      defines it, so make it static and unexport it.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      5d3c8617
    • A
      efi: pstore: move workqueue handling out of efivars · c9b51a2d
      Ard Biesheuvel 提交于
      The worker thread that gets kicked off to sync the state of the
      EFI variable list is only used by the EFI pstore implementation,
      and is defined in its source file. So let's move its scheduling
      there as well. Since our efivar_init() scan will bail on duplicate
      entries, there is no need to disable the workqueue like we did
      before, so we can run it unconditionally.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      c9b51a2d
    • A
      efi: pstore: disentangle from deprecated efivars module · 232f4eb6
      Ard Biesheuvel 提交于
      The EFI pstore implementation relies on the 'efivars' abstraction,
      which encapsulates the EFI variable store in a way that can be
      overridden by other backing stores, like the Google SMI one.
      
      On top of that, the EFI pstore implementation also relies on the
      efivars.ko module, which is a separate layer built on top of the
      'efivars' abstraction that exposes the [deprecated] sysfs entries
      for each variable that exists in the backing store.
      
      Since the efivars.ko module is deprecated, and all users appear to
      have moved to the efivarfs file system instead, let's prepare for
      its removal, by removing EFI pstore's dependency on it.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      232f4eb6
  12. 29 9月, 2020 2 次提交
    • T
      net: core: add nested_level variable in net_device · 1fc70edb
      Taehee Yoo 提交于
      This patch is to add a new variable 'nested_level' into the net_device
      structure.
      This variable will be used as a parameter of spin_lock_nested() of
      dev->addr_list_lock.
      
      netif_addr_lock() can be called recursively so spin_lock_nested() is
      used instead of spin_lock() and dev->lower_level is used as a parameter
      of spin_lock_nested().
      But, dev->lower_level value can be updated while it is being used.
      So, lockdep would warn a possible deadlock scenario.
      
      When a stacked interface is deleted, netif_{uc | mc}_sync() is
      called recursively.
      So, spin_lock_nested() is called recursively too.
      At this moment, the dev->lower_level variable is used as a parameter of it.
      dev->lower_level value is updated when interfaces are being unlinked/linked
      immediately.
      Thus, After unlinking, dev->lower_level shouldn't be a parameter of
      spin_lock_nested().
      
          A (macvlan)
          |
          B (vlan)
          |
          C (bridge)
          |
          D (macvlan)
          |
          E (vlan)
          |
          F (bridge)
      
          A->lower_level : 6
          B->lower_level : 5
          C->lower_level : 4
          D->lower_level : 3
          E->lower_level : 2
          F->lower_level : 1
      
      When an interface 'A' is removed, it releases resources.
      At this moment, netif_addr_lock() would be called.
      Then, netdev_upper_dev_unlink() is called recursively.
      Then dev->lower_level is updated.
      There is no problem.
      
      But, when the bridge module is removed, 'C' and 'F' interfaces
      are removed at once.
      If 'F' is removed first, a lower_level value is like below.
          A->lower_level : 5
          B->lower_level : 4
          C->lower_level : 3
          D->lower_level : 2
          E->lower_level : 1
          F->lower_level : 1
      
      Then, 'C' is removed. at this moment, netif_addr_lock() is called
      recursively.
      The ordering is like this.
      C(3)->D(2)->E(1)->F(1)
      At this moment, the lower_level value of 'E' and 'F' are the same.
      So, lockdep warns a possible deadlock scenario.
      
      In order to avoid this problem, a new variable 'nested_level' is added.
      This value is the same as dev->lower_level - 1.
      But this value is updated in rtnl_unlock().
      So, this variable can be used as a parameter of spin_lock_nested() safely
      in the rtnl context.
      
      Test commands:
         ip link add br0 type bridge vlan_filtering 1
         ip link add vlan1 link br0 type vlan id 10
         ip link add macvlan2 link vlan1 type macvlan
         ip link add br3 type bridge vlan_filtering 1
         ip link set macvlan2 master br3
         ip link add vlan4 link br3 type vlan id 10
         ip link add macvlan5 link vlan4 type macvlan
         ip link add br6 type bridge vlan_filtering 1
         ip link set macvlan5 master br6
         ip link add vlan7 link br6 type vlan id 10
         ip link add macvlan8 link vlan7 type macvlan
      
         ip link set br0 up
         ip link set vlan1 up
         ip link set macvlan2 up
         ip link set br3 up
         ip link set vlan4 up
         ip link set macvlan5 up
         ip link set br6 up
         ip link set vlan7 up
         ip link set macvlan8 up
         modprobe -rv bridge
      
      Splat looks like:
      [   36.057436][  T744] WARNING: possible recursive locking detected
      [   36.058848][  T744] 5.9.0-rc6+ #728 Not tainted
      [   36.059959][  T744] --------------------------------------------
      [   36.061391][  T744] ip/744 is trying to acquire lock:
      [   36.062590][  T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30
      [   36.064922][  T744]
      [   36.064922][  T744] but task is already holding lock:
      [   36.066626][  T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.068851][  T744]
      [   36.068851][  T744] other info that might help us debug this:
      [   36.070731][  T744]  Possible unsafe locking scenario:
      [   36.070731][  T744]
      [   36.072497][  T744]        CPU0
      [   36.073238][  T744]        ----
      [   36.074007][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.075290][  T744]   lock(&vlan_netdev_addr_lock_key);
      [   36.076590][  T744]
      [   36.076590][  T744]  *** DEADLOCK ***
      [   36.076590][  T744]
      [   36.078515][  T744]  May be due to missing lock nesting notation
      [   36.078515][  T744]
      [   36.080491][  T744] 3 locks held by ip/744:
      [   36.081471][  T744]  #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490
      [   36.083614][  T744]  #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
      [   36.085942][  T744]  #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80
      [   36.088400][  T744]
      [   36.088400][  T744] stack backtrace:
      [   36.089772][  T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728
      [   36.091364][  T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [   36.093630][  T744] Call Trace:
      [   36.094416][  T744]  dump_stack+0x77/0x9b
      [   36.095385][  T744]  __lock_acquire+0xbc3/0x1f40
      [   36.096522][  T744]  lock_acquire+0xb4/0x3b0
      [   36.097540][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.098657][  T744]  ? rtmsg_ifinfo+0x1f/0x30
      [   36.099711][  T744]  ? __dev_notify_flags+0xa5/0xf0
      [   36.100874][  T744]  ? rtnl_is_locked+0x11/0x20
      [   36.101967][  T744]  ? __dev_set_promiscuity+0x7b/0x1a0
      [   36.103230][  T744]  _raw_spin_lock_bh+0x38/0x70
      [   36.104348][  T744]  ? dev_set_rx_mode+0x19/0x30
      [   36.105461][  T744]  dev_set_rx_mode+0x19/0x30
      [   36.106532][  T744]  dev_set_promiscuity+0x36/0x50
      [   36.107692][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.108929][  T744]  dev_set_promiscuity+0x1e/0x50
      [   36.110093][  T744]  br_port_set_promisc+0x1f/0x40 [bridge]
      [   36.111415][  T744]  br_manage_promisc+0x8b/0xe0 [bridge]
      [   36.112728][  T744]  __dev_set_promiscuity+0x123/0x1a0
      [   36.113967][  T744]  ? __hw_addr_sync_one+0x23/0x50
      [   36.115135][  T744]  __dev_set_rx_mode+0x68/0x90
      [   36.116249][  T744]  dev_uc_sync+0x70/0x80
      [   36.117244][  T744]  dev_uc_add+0x50/0x60
      [   36.118223][  T744]  macvlan_open+0x18e/0x1f0 [macvlan]
      [   36.119470][  T744]  __dev_open+0xd6/0x170
      [   36.120470][  T744]  __dev_change_flags+0x181/0x1d0
      [   36.121644][  T744]  dev_change_flags+0x23/0x60
      [   36.122741][  T744]  do_setlink+0x30a/0x11e0
      [   36.123778][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.124929][  T744]  ? __nla_validate_parse.part.6+0x45/0x8e0
      [   36.126309][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.127457][  T744]  __rtnl_newlink+0x546/0x8e0
      [   36.128560][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.129623][  T744]  ? deactivate_slab.isra.85+0x6a1/0x850
      [   36.130946][  T744]  ? __lock_acquire+0x92c/0x1f40
      [   36.132102][  T744]  ? lock_acquire+0xb4/0x3b0
      [   36.133176][  T744]  ? is_bpf_text_address+0x5/0xe0
      [   36.134364][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.135445][  T744]  ? rcu_read_lock_sched_held+0x32/0x60
      [   36.136771][  T744]  ? kmem_cache_alloc_trace+0x2d8/0x380
      [   36.138070][  T744]  ? rtnl_newlink+0x2e/0x70
      [   36.139164][  T744]  rtnl_newlink+0x47/0x70
      [ ... ]
      
      Fixes: 845e0ebb ("net: change addr_list_lock back to static key")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fc70edb
    • T
      net: core: introduce struct netdev_nested_priv for nested interface infrastructure · eff74233
      Taehee Yoo 提交于
      Functions related to nested interface infrastructure such as
      netdev_walk_all_{ upper | lower }_dev() pass both private functions
      and "data" pointer to handle their own things.
      At this point, the data pointer type is void *.
      In order to make it easier to expand common variables and functions,
      this new netdev_nested_priv structure is added.
      
      In the following patch, a new member variable will be added into this
      struct to fix the lockdep issue.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eff74233
  13. 28 9月, 2020 4 次提交
  14. 27 9月, 2020 3 次提交
    • L
      mm: don't rely on system state to detect hot-plug operations · f85086f9
      Laurent Dufour 提交于
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f85086f9
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
    • V
      mm/gup: fix gup_fast with dynamic page table folding · d3f7b1bb
      Vasily Gorbik 提交于
      Currently to make sure that every page table entry is read just once
      gup_fast walks perform READ_ONCE and pass pXd value down to the next
      gup_pXd_range function by value e.g.:
      
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        ...
                pudp = pud_offset(&p4d, addr);
      
      This function passes a reference on that local value copy to pXd_offset,
      and might get the very same pointer in return.  This happens when the
      level is folded (on most arches), and that pointer should not be
      iterated.
      
      On s390 due to the fact that each task might have different 5,4 or
      3-level address translation and hence different levels folded the logic
      is more complex and non-iteratable pointer to a local copy leads to
      severe problems.
      
      Here is an example of what happens with gup_fast on s390, for a task
      with 3-level paging, crossing a 2 GB pud boundary:
      
        // addr = 0x1007ffff000, end = 0x10080001000
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        {
              unsigned long next;
              pud_t *pudp;
      
              // pud_offset returns &p4d itself (a pointer to a value on stack)
              pudp = pud_offset(&p4d, addr);
              do {
                      // on second iteratation reading "random" stack value
                      pud_t pud = READ_ONCE(*pudp);
      
                      // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                      next = pud_addr_end(addr, end);
                      ...
              } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
      
              return 1;
        }
      
      This happens since s390 moved to common gup code with commit
      d1874a0c ("s390/mm: make the pxd_offset functions more robust") and
      commit 1a42010c ("s390/mm: convert to the generic
      get_user_pages_fast code").
      
      s390 tried to mimic static level folding by changing pXd_offset
      primitives to always calculate top level page table offset in pgd_offset
      and just return the value passed when pXd_offset has to act as folded.
      
      What is crucial for gup_fast and what has been overlooked is that
      PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
      And the latter is not possible with dynamic folding.
      
      To fix the issue in addition to pXd values pass original pXdp pointers
      down to gup_pXd_range functions.  And introduce pXd_offset_lockless
      helpers, which take an additional pXd entry value parameter.  This has
      already been discussed in
      
        https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
      
      Fixes: 1a42010c ("s390/mm: convert to the generic get_user_pages_fast code")
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hoursSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3f7b1bb
  15. 26 9月, 2020 1 次提交
  16. 25 9月, 2020 3 次提交