1. 17 2月, 2021 1 次提交
  2. 28 1月, 2021 1 次提交
  3. 14 1月, 2021 1 次提交
  4. 23 12月, 2020 1 次提交
  5. 02 12月, 2020 1 次提交
    • G
      kernel: Implement selective syscall userspace redirection · 1446e1df
      Gabriel Krisman Bertazi 提交于
      Introduce a mechanism to quickly disable/enable syscall handling for a
      specific process and redirect to userspace via SIGSYS.  This is useful
      for processes with parts that require syscall redirection and parts that
      don't, but who need to perform this boundary crossing really fast,
      without paying the cost of a system call to reconfigure syscall handling
      on each boundary transition.  This is particularly important for Windows
      games running over Wine.
      
      The proposed interface looks like this:
      
        prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
      
      The range [<offset>,<offset>+<length>) is a part of the process memory
      map that is allowed to by-pass the redirection code and dispatch
      syscalls directly, such that in fast paths a process doesn't need to
      disable the trap nor the kernel has to check the selector.  This is
      essential to return from SIGSYS to a blocked area without triggering
      another SIGSYS from rt_sigreturn.
      
      selector is an optional pointer to a char-sized userspace memory region
      that has a key switch for the mechanism. This key switch is set to
      either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
      redirection without calling the kernel.
      
      The feature is meant to be set per-thread and it is disabled on
      fork/clone/execv.
      
      Internally, this doesn't add overhead to the syscall hot path, and it
      requires very little per-architecture support.  I avoided using seccomp,
      even though it duplicates some functionality, due to previous feedback
      that maybe it shouldn't mix with seccomp since it is not a security
      mechanism.  And obviously, this should never be considered a security
      mechanism, since any part of the program can by-pass it by using the
      syscall dispatcher.
      
      For the sysinfo benchmark, which measures the overhead added to
      executing a native syscall that doesn't require interception, the
      overhead using only the direct dispatcher region to issue syscalls is
      pretty much irrelevant.  The overhead of using the selector goes around
      40ns for a native (unredirected) syscall in my system, and it is (as
      expected) dominated by the supervisor-mode user-address access.  In
      fact, with SMAP off, the overhead is consistently less than 5ns on my
      test box.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
      1446e1df
  6. 24 11月, 2020 2 次提交
  7. 17 11月, 2020 2 次提交
    • J
      sched/deadline: Fix priority inheritance with multiple scheduling classes · 2279f540
      Juri Lelli 提交于
      Glenn reported that "an application [he developed produces] a BUG in
      deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
      PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
      task that was boosted by a SCHED_DEADLINE task boosts another CFS task
      (nested priority inheritance).
      
       ------------[ cut here ]------------
       kernel BUG at kernel/sched/deadline.c:1462!
       invalid opcode: 0000 [#1] PREEMPT SMP
       CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
       Hardware name: ...
       RIP: 0010:enqueue_task_dl+0x335/0x910
       Code: ...
       RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
       RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
       RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
       RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
       R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
       R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
       FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        ? intel_pstate_update_util_hwp+0x13/0x170
        rt_mutex_setprio+0x1cc/0x4b0
        task_blocks_on_rt_mutex+0x225/0x260
        rt_spin_lock_slowlock_locked+0xab/0x2d0
        rt_spin_lock_slowlock+0x50/0x80
        hrtimer_grab_expiry_lock+0x20/0x30
        hrtimer_cancel+0x13/0x30
        do_nanosleep+0xa0/0x150
        hrtimer_nanosleep+0xe1/0x230
        ? __hrtimer_init_sleeper+0x60/0x60
        __x64_sys_nanosleep+0x8d/0xa0
        do_syscall_64+0x4a/0x100
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fa58b52330d
       ...
       ---[ end trace 0000000000000002 ]—
      
      He also provided a simple reproducer creating the situation below:
      
       So the execution order of locking steps are the following
       (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
       are mutexes that are enabled * with priority inheritance.)
      
       Time moves forward as this timeline goes down:
      
       N1              N2               D1
       |               |                |
       |               |                |
       Lock(M1)        |                |
       |               |                |
       |             Lock(M2)           |
       |               |                |
       |               |              Lock(M2)
       |               |                |
       |             Lock(M1)           |
       |             (!!bug triggered!) |
      
      Daniel reported a similar situation as well, by just letting ksoftirqd
      run with DEADLINE (and eventually block on a mutex).
      
      Problem is that boosted entities (Priority Inheritance) use static
      DEADLINE parameters of the top priority waiter. However, there might be
      cases where top waiter could be a non-DEADLINE entity that is currently
      boosted by a DEADLINE entity from a different lock chain (i.e., nested
      priority chains involving entities of non-DEADLINE classes). In this
      case, top waiter static DEADLINE parameters could be null (initialized
      to 0 at fork()) and replenish_dl_entity() would hit a BUG().
      
      Fix this by keeping track of the original donor and using its parameters
      when a task is boosted.
      Reported-by: NGlenn Elliott <glenn@aurora.tech>
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
      2279f540
    • P
      sched: Fix data-race in wakeup · f97bb527
      Peter Zijlstra 提交于
      Mel reported that on some ARM64 platforms loadavg goes bananas and
      Will tracked it down to the following race:
      
        CPU0					CPU1
      
        schedule()
          prev->sched_contributes_to_load = X;
          deactivate_task(prev);
      
      					try_to_wake_up()
      					  if (p->on_rq &&) // false
      					  if (smp_load_acquire(&p->on_cpu) && // true
      					      ttwu_queue_wakelist())
      					        p->sched_remote_wakeup = Y;
      
          smp_store_release(prev->on_cpu, 0);
      
      where both p->sched_contributes_to_load and p->sched_remote_wakeup are
      in the same word, and thus the stores X and Y race (and can clobber
      one another's data).
      
      Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu()
      spinning on p->on_cpu") the p->on_cpu handoff serialized access to
      p->sched_remote_wakeup (just as it still does with
      p->sched_contributes_to_load) that commit broke that by calling
      ttwu_queue_wakelist() with p->on_cpu != 0.
      
      However, due to
      
        p->XXX = X			ttwu()
        schedule()			  if (p->on_rq && ...) // false
          smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
          deactivate_task()		      ttwu_queue_wakelist())
            p->on_rq = 0;		        p->sched_remote_wakeup = Y;
      
      We can be sure any 'current' store is complete and 'current' is
      guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
      the current flags word.
      
      Note: while the observed failure was loadavg accounting gone wrong due
      to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
      is also possible where schedule() clobbers p->sched_remote_wakeup,
      this could result in enqueue_entity() wrecking ->vruntime and causing
      scheduling artifacts.
      
      Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Debugged-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
      f97bb527
  8. 11 11月, 2020 3 次提交
  9. 17 10月, 2020 1 次提交
  10. 14 10月, 2020 1 次提交
  11. 13 10月, 2020 1 次提交
  12. 07 10月, 2020 1 次提交
    • T
      x86/mce: Recover from poison found while copying from user space · c0ab7ffc
      Tony Luck 提交于
      Existing kernel code can only recover from a machine check on code that
      is tagged in the exception table with a fault handling recovery path.
      
      Add two new fields in the task structure to pass information from
      machine check handler to the "task_work" that is queued to run before
      the task returns to user mode:
      
      + mce_vaddr: will be initialized to the user virtual address of the fault
        in the case where the fault occurred in the kernel copying data from
        a user address.  This is so that kill_me_maybe() can provide that
        information to the user SIGBUS handler.
      
      + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
        to determine if mce_vaddr is applicable to this error.
      
      Add code to recover from a machine check while copying data from user
      space to the kernel. Action for this case is the same as if the user
      touched the poison directly; unmap the page and send a SIGBUS to the task.
      
      Use a new helper function to share common code between the "fault
      in user mode" case and the "fault while copying from user" case.
      
      New code paths will be activated by the next patch which sets
      MCE_IN_KERNEL_COPYIN.
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
      c0ab7ffc
  13. 03 10月, 2020 1 次提交
  14. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  15. 26 8月, 2020 2 次提交
  16. 06 8月, 2020 2 次提交
    • T
      posix-cpu-timers: Provide mechanisms to defer timer handling to task_work · 1fb497dd
      Thomas Gleixner 提交于
      Running posix CPU timers in hard interrupt context has a few downsides:
      
       - For PREEMPT_RT it cannot work as the expiry code needs to take
         sighand lock, which is a 'sleeping spinlock' in RT. The original RT
         approach of offloading the posix CPU timer handling into a high
         priority thread was clumsy and provided no real benefit in general.
      
       - For fine grained accounting it's just wrong to run this in context of
         the timer interrupt because that way a process specific CPU time is
         accounted to the timer interrupt.
      
       - Long running timer interrupts caused by a large amount of expiring
         timers which can be created and armed by unpriviledged user space.
      
      There is no hard requirement to expire them in interrupt context.
      
      If the signal is targeted at the task itself then it won't be delivered
      before the task returns to user space anyway. If the signal is targeted at
      a supervisor process then it might be slightly delayed, but posix CPU
      timers are inaccurate anyway due to the fact that they are tied to the
      tick.
      
      Provide infrastructure to schedule task work which allows splitting the
      posix CPU timer code into a quick check in interrupt context and a thread
      context expiry and signal delivery function. This has to be enabled by
      architectures as it requires that the architecture specific KVM
      implementation handles pending task work before exiting to guest mode.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
      1fb497dd
    • P
      locking/seqlock, headers: Untangle the spaghetti monster · 0cd39f46
      Peter Zijlstra 提交于
      By using lockdep_assert_*() from seqlock.h, the spaghetti monster
      attacked.
      
      Attack back by reducing seqlock.h dependencies from two key high level headers:
      
       - <linux/seqlock.h>:               -Remove <linux/ww_mutex.h>
       - <linux/time.h>:                  -Remove <linux/seqlock.h>
       - <linux/sched.h>:                 +Add    <linux/seqlock.h>
      
      The price was to add it to sched.h ...
      
      Core header fallout, we add direct header dependencies instead of gaining them
      parasitically from higher level headers:
      
       - <linux/dynamic_queue_limits.h>:  +Add <asm/bug.h>
       - <linux/hrtimer.h>:               +Add <linux/seqlock.h>
       - <linux/ktime.h>:                 +Add <asm/bug.h>
       - <linux/lockdep.h>:               +Add <linux/smp.h>
       - <linux/sched.h>:                 +Add <linux/seqlock.h>
       - <linux/videodev2.h>:             +Add <linux/kernel.h>
      
      Arch headers fallout:
      
       - PARISC: <asm/timex.h>:           +Add <asm/special_insns.h>
       - SH:     <asm/io.h>:              +Add <asm/page.h>
       - SPARC:  <asm/timer_64.h>:        +Add <uapi/asm/asi.h>
       - SPARC:  <asm/vvar.h>:            +Add <asm/processor.h>, <asm/barrier.h>
                                          -Remove <linux/seqlock.h>
       - X86:    <asm/fixmap.h>:          +Add <asm/pgtable_types.h>
                                          -Remove <asm/acpi.h>
      
      There's also a bunch of parasitic header dependency fallout in .c files, not listed
      separately.
      
      [ mingo: Extended the changelog, split up & fixed the original patch. ]
      Co-developed-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200804133438.GK2674@hirez.programming.kicks-ass.net
      0cd39f46
  17. 31 7月, 2020 2 次提交
  18. 29 7月, 2020 2 次提交
    • A
      sched: tasks: Use sequence counter with associated spinlock · b7505861
      Ahmed S. Darwish 提交于
      A sequence counter write side critical section must be protected by some
      form of locking to serialize writers. A plain seqcount_t does not
      contain the information of which lock must be held when entering a write
      side critical section.
      
      Use the new seqcount_spinlock_t data type, which allows to associate a
      spinlock with the sequence counter. This enables lockdep to verify that
      the spinlock used for writer serialization is held when the write side
      critical section is entered.
      
      If lockdep is disabled this lock association is compiled out and has
      neither storage size nor runtime overhead.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de
      b7505861
    • Q
      sched/uclamp: Add a new sysctl to control RT default boost value · 13685c4a
      Qais Yousef 提交于
      RT tasks by default run at the highest capacity/performance level. When
      uclamp is selected this default behavior is retained by enforcing the
      requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
      uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
      value.
      
      This is also referred to as 'the default boost value of RT tasks'.
      
      See commit 1a00d999 ("sched/uclamp: Set default clamps for RT tasks").
      
      On battery powered devices, it is desired to control this default
      (currently hardcoded) behavior at runtime to reduce energy consumed by
      RT tasks.
      
      For example, a mobile device manufacturer where big.LITTLE architecture
      is dominant, the performance of the little cores varies across SoCs, and
      on high end ones the big cores could be too power hungry.
      
      Given the diversity of SoCs, the new knob allows manufactures to tune
      the best performance/power for RT tasks for the particular hardware they
      run on.
      
      They could opt to further tune the value when the user selects
      a different power saving mode or when the device is actively charging.
      
      The runtime aspect of it further helps in creating a single kernel image
      that can be run on multiple devices that require different tuning.
      
      Keep in mind that a lot of RT tasks in the system are created by the
      kernel. On Android for instance I can see over 50 RT tasks, only
      a handful of which created by the Android framework.
      
      To control the default behavior globally by system admins and device
      integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
      to change the default boost value of the RT tasks.
      
      I anticipate this to be mostly in the form of modifying the init script
      of a particular device.
      
      To avoid polluting the fast path with unnecessary code, the approach
      taken is to synchronously do the update by traversing all the existing
      tasks in the system. This could race with a concurrent fork(), which is
      dealt with by introducing sched_post_fork() function which will ensure
      the racy fork will get the right update applied.
      
      Tested on Juno-r2 in combination with the RT capacity awareness [1].
      By default an RT task will go to the highest capacity CPU and run at the
      maximum frequency, which is particularly energy inefficient on high end
      mobile devices because the biggest core[s] are 'huge' and power hungry.
      
      With this patch the RT task can be controlled to run anywhere by
      default, and doesn't cause the frequency to be maximum all the time.
      Yet any task that really needs to be boosted can easily escape this
      default behavior by modifying its requested uclamp.min value
      (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
      
      [1] 804d402f: ("sched/rt: Make RT capacity-aware")
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
      13685c4a
  19. 28 7月, 2020 1 次提交
  20. 22 7月, 2020 1 次提交
  21. 10 7月, 2020 1 次提交
  22. 08 7月, 2020 3 次提交
  23. 04 7月, 2020 1 次提交
  24. 28 6月, 2020 2 次提交
  25. 15 6月, 2020 2 次提交
    • P
      sched: Remove sched_set_*() return value · 8b700983
      Peter Zijlstra 提交于
      Ingo suggested that since the new sched_set_*() functions are
      implemented using the 'nocheck' variants, they really shouldn't ever
      fail, so remove the return value.
      
      Cc: axboe@kernel.dk
      Cc: daniel.lezcano@linaro.org
      Cc: sudeep.holla@arm.com
      Cc: airlied@redhat.com
      Cc: broonie@kernel.org
      Cc: paulmck@kernel.org
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      8b700983
    • P
      sched: Provide sched_set_fifo() · 7318d4cc
      Peter Zijlstra 提交于
      SCHED_FIFO (or any static priority scheduler) is a broken scheduler
      model; it is fundamentally incapable of resource management, the one
      thing an OS is actually supposed to do.
      
      It is impossible to compose static priority workloads. One cannot take
      two well designed and functional static priority workloads and mash
      them together and still expect them to work.
      
      Therefore it doesn't make sense to expose the priority field; the
      kernel is fundamentally incapable of setting a sensible value, it
      needs systems knowledge that it doesn't have.
      
      Take away sched_setschedule() / sched_setattr() from modules and
      replace them with:
      
        - sched_set_fifo(p); create a FIFO task (at prio 50)
        - sched_set_fifo_low(p); create a task higher than NORMAL,
      	which ends up being a FIFO task at prio 1.
        - sched_set_normal(p, nice); (re)set the task to normal
      
      This stops the proliferation of randomly chosen, and irrelevant, FIFO
      priorities that dont't really mean anything anyway.
      
      The system administrator/integrator, whoever has insight into the
      actual system design and requirements (userspace) can set-up
      appropriate priorities if and when needed.
      
      Cc: airlied@redhat.com
      Cc: alexander.deucher@amd.com
      Cc: awalls@md.metrocast.net
      Cc: axboe@kernel.dk
      Cc: broonie@kernel.org
      Cc: daniel.lezcano@linaro.org
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: herbert@gondor.apana.org.au
      Cc: hverkuil@xs4all.nl
      Cc: john.stultz@linaro.org
      Cc: nico@fluxnic.net
      Cc: paulmck@kernel.org
      Cc: rafael.j.wysocki@intel.com
      Cc: rmk+kernel@arm.linux.org.uk
      Cc: sudeep.holla@arm.com
      Cc: tglx@linutronix.de
      Cc: ulf.hansson@linaro.org
      Cc: wim@linux-watchdog.org
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NPaul E. McKenney <paulmck@kernel.org>
      7318d4cc
  26. 11 6月, 2020 1 次提交
    • T
      x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned · 17fae129
      Tony Luck 提交于
      An interesting thing happened when a guest Linux instance took a machine
      check. The VMM unmapped the bad page from guest physical space and
      passed the machine check to the guest.
      
      Linux took all the normal actions to offline the page from the process
      that was using it. But then guest Linux crashed because it said there
      was a second machine check inside the kernel with this stack trace:
      
      do_memory_failure
          set_mce_nospec
               set_memory_uc
                    _set_memory_uc
                         change_page_attr_set_clr
                              cpa_flush
                                   clflush_cache_range_opt
      
      This was odd, because a CLFLUSH instruction shouldn't raise a machine
      check (it isn't consuming the data). Further investigation showed that
      the VMM had passed in another machine check because is appeared that the
      guest was accessing the bad page.
      
      Fix is to check the scope of the poison by checking the MCi_MISC register.
      If the entire page is affected, then unmap the page. If only part of the
      page is affected, then mark the page as uncacheable.
      
      This assumes that VMMs will do the logical thing and pass in the "whole
      page scope" via the MCi_MISC register (since they unmapped the entire
      page).
      
        [ bp: Adjust to x86/entry changes. ]
      
      Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
      Reported-by: NJue Wang <juew@google.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJue Wang <juew@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
      
      
      17fae129
  27. 05 6月, 2020 1 次提交
  28. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715