• D
    workqueue, ktask: renice helper threads to prevent starvation · cdc79c13
    Daniel Jordan 提交于
    hulk inclusion
    category: feature
    bugzilla: 13228
    CVE: NA
    ---------------------------
    
    With ktask helper threads running at MAX_NICE, it's possible for one or
    more of them to begin chunks of the task and then have their CPU time
    constrained by higher priority threads.  The main ktask thread, running
    at normal priority, may finish all available chunks of the task and then
    wait on the MAX_NICE helpers to finish the last in-progress chunks, for
    longer than it would have if no helpers were used.
    
    Avoid this by having the main thread assign its priority to each
    unfinished helper one at a time so that on a heavily loaded system,
    exactly one thread in a given ktask call is running at the main thread's
    priority.  At least one thread to ensure forward progress, and at most
    one thread to limit excessive multithreading.
    
    Since the workqueue interface, on which ktask is built, does not provide
    access to worker threads, ktask can't adjust their priorities directly,
    so add a new interface to allow a previously-queued work item to run at
    a different priority than the one controlled by the corresponding
    workqueue's 'nice' attribute.  The worker assigned to the work item will
    run the work at the given priority, temporarily overriding the worker's
    priority.
    
    The interface is flush_work_at_nice, which ensures the given work item's
    assigned worker runs at the specified nice level and waits for the work
    item to finish.
    
    An alternative choice would have been to simply requeue the work item to
    a pool with workers of the new priority, but this doesn't seem feasible
    because a worker may have already started executing the work and there's
    currently no way to interrupt it midway through.  The proposed interface
    solves this issue because a worker's priority can be adjusted while it's
    executing the work.
    
    TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
    desired to have the interface set the work's nice without also waiting
    for it to finish.  It's implemented in the flush path for this RFC
    because it was fairly simple to write ;-)
    
    I ran tests similar to the ones in the last patch with a couple of
    differences:
     - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
       main ktask thread as well as the ktask helpers, so that when the main
       thread finishes, its CPU is completely occupied by the non-ktask
       workload, meaning MAX_NICE helpers can't run as often.
     - The non-ktask workload starts before the ktask workload, rather
       than after, to maximize the chance that it starves helpers.
    
    Runtimes in seconds.
    
    Case 1: Synthetic, worst-case CPU contention
    
     ktask_test - a tight loop doing integer multiplication to max out on CPU;
                  used for testing only, does not appear in this series
     stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");
    
                 8_ktask_thrs           8_ktask_thrs
                 w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                 ------------------------------------------------------------
      ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
      stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)
    
    Without renicing, ktask_test finishes just after stress-ng does because
    stress-ng needs to free up CPUs for the helpers to finish (ktask_test
    shows a shorter runtime than stress-ng because ktask_test was started
    later).  Renicing lets ktask_test finish 40% sooner, and running the
    same amount of work in ktask_test with 1 thread instead of 8 finishes in
    a comparable amount of time, though longer than "with_renice" because
    MAX_NICE threads still get some CPU time, and the effect over 8 threads
    adds up.
    
    stress-ng's total runtime gets a little longer going from no renicing to
    renicing, as expected, because each reniced ktask thread takes more CPU
    time than before when the helpers were starved.
    
    Running with one ktask thread, stress-ng's reported walltime goes up
    because that single thread interferes with fewer stress-ng threads,
    but with more impact, causing a greater spread in the time it takes for
    individual stress-ng threads to finish.  Averages of the per-thread
    stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
    the same, though, 43.81 and 43.89 respectively.  So the total runtime of
    stress-ng across all threads is unaffected, but the time stress-ng takes
    to finish running its threads completely actually improves by spreading
    the ktask_test work over more threads.
    
    Case 2: Real-world CPU contention
    
     ktask_vfio - VFIO page pin a 32G kvm guest
     usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
                  used to mimic the page clearing that dominates in ktask_vfio
                  so that usemem competes for the same system resources
    
                 8_ktask_thrs           8_ktask_thrs
                 w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                 --------------------------------------------------------------
      ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
          usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)
    
    These results are similar to case 1's, though the differences between
    times are not quite as pronounced because ktask_vfio ran shorter
    compared to usemem.
    Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
    Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
    Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
    Tested-by: NHongbo Yao <yaohongbo@huawei.com>
    Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
    cdc79c13
ktask.c 17.3 KB