1. 14 7月, 2012 2 次提交
    • T
      workqueue: reimplement WQ_HIGHPRI using a separate worker_pool · 3270476a
      Tejun Heo 提交于
      WQ_HIGHPRI was implemented by queueing highpri work items at the head
      of the global worklist.  Other than queueing at the head, they weren't
      handled differently; unfortunately, this could lead to execution
      latency of a few seconds on heavily loaded systems.
      
      Now that workqueue code has been updated to deal with multiple
      worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
      a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
      gcwq->pools[0] is used for normal pri work items and ->pools[1] for
      highpri.  Highpri workers get -20 nice level and has 'H' suffix in
      their names.  Note that this change increases the number of kworkers
      per cpu.
      
      POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
      wakeup code in process_one_work() are no longer used and removed.
      
      This allows proper prioritization of highpri work items and removes
      high execution latency of highpri work items.
      
      v2: nr_running indexing bug in get_pool_nr_running() fixed.
      
      v3: Refreshed for the get_pool_nr_running() update in the previous
          patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJosh Hunt <joshhunt00@gmail.com>
      LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      3270476a
    • T
      workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool() · 4ce62e9e
      Tejun Heo 提交于
      Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
      paths which need to manipulate all pools in a gcwq to use them.
      NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
      over only @gcwq->pool.
      
      Note that nr_running is per-pool property and converted to an array
      with NR_WORKER_POOLS elements and renamed to pool_nr_running.  Note
      that get_pool_nr_running() currently assumes 0 index.  The next patch
      will make use of non-zero index.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      
      v2: nr_running indexing bug in get_pool_nr_running() fixed.
      
      v3: Pointer to array is stupid.  Don't use it in get_pool_nr_running()
          as suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      4ce62e9e
  2. 13 7月, 2012 4 次提交
    • T
      workqueue: separate out worker_pool flags · 11ebea50
      Tejun Heo 提交于
      GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
      are per-pool properties.  Add worker_pool->flags and make the above
      three flags per-pool flags.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      11ebea50
    • T
      workqueue: use @pool instead of @gcwq or @cpu where applicable · 63d95a91
      Tejun Heo 提交于
      Modify all functions which deal with per-pool properties to pass
      around @pool instead of @gcwq or @cpu.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      63d95a91
    • T
      workqueue: factor out worker_pool from global_cwq · bd7bdd43
      Tejun Heo 提交于
      Move worklist and all worker management fields from global_cwq into
      the new struct worker_pool.  worker_pool points back to the containing
      gcwq.  worker and cpu_workqueue_struct are updated to point to
      worker_pool instead of gcwq too.
      
      This change is mechanical and doesn't introduce any functional
      difference other than rearranging of fields and an added level of
      indirection in some places.  This is to prepare for multiple pools per
      gcwq.
      
      v2: Comment typo fixes as suggested by Namhyung.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      bd7bdd43
    • T
      workqueue: don't use WQ_HIGHPRI for unbound workqueues · 974271c4
      Tejun Heo 提交于
      Unbound wqs aren't concurrency-managed and try to execute work items
      as soon as possible.  This is currently achieved by implicitly setting
      %WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
      implementation is about to be restructured and this usage won't be
      valid anymore.
      
      Add an explicit chain-wakeup path for unbound workqueues in
      process_one_work() instead of piggy backing on %WQ_HIGHPRI.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      974271c4
  3. 15 5月, 2012 2 次提交
    • P
      lockdep: fix oops in processing workqueue · 4d82a1de
      Peter Zijlstra 提交于
      Under memory load, on x86_64, with lockdep enabled, the workqueue's
      process_one_work() has been seen to oops in __lock_acquire(), barfing
      on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
      
      Because it's permissible to free a work_struct from its callout function,
      the map used is an onstack copy of the map given in the work_struct: and
      that copy is made without any locking.
      
      Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
      "rep movsq" for that structure copy: which might race with a workqueue
      user's wait_on_work() doing lock_map_acquire() on the source of the
      copy, putting a pointer into the class_cache[], but only in time for
      the top half of that pointer to be copied to the destination map.
      
      Boom when process_one_work() subsequently does lock_map_acquire()
      on its onstack copy of the lockdep_map.
      
      Fix this, and a similar instance in call_timer_fn(), with a
      lockdep_copy_map() function which additionally NULLs the class_cache[].
      
      Note: this oops was actually seen on 3.4-next, where flush_work() newly
      does the racing lock_map_acquire(); but Tejun points out that 3.4 and
      earlier are already vulnerable to the same through wait_on_work().
      
      * Patch orginally from Peter.  Hugh modified it a bit and wrote the
        description.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4d82a1de
    • T
      workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active · 544ecf31
      Tejun Heo 提交于
      worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running
      isn't zero when every worker is idle.  This can trigger spuriously
      while a cpu is going down due to the way trustee sets %WORKER_ROGUE
      and zaps nr_running.
      
      It first sets %WORKER_ROGUE on all workers without updating
      nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and
      then zaps nr_running.  If the last running worker enters idle
      inbetween, it would see stale nr_running which hasn't been zapped yet
      and trigger the WARN_ON_ONCE().
      
      Fix it by performing the sanity check iff the trustee is idle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      544ecf31
  4. 24 4月, 2012 1 次提交
    • S
      workqueue: Catch more locking problems with flush_work() · 0976dfc1
      Stephen Boyd 提交于
      If a workqueue is flushed with flush_work() lockdep checking can
      be circumvented. For example:
      
       static DEFINE_MUTEX(mutex);
      
       static void my_work(struct work_struct *w)
       {
               mutex_lock(&mutex);
               mutex_unlock(&mutex);
       }
      
       static DECLARE_WORK(work, my_work);
      
       static int __init start_test_module(void)
       {
               schedule_work(&work);
               return 0;
       }
       module_init(start_test_module);
      
       static void __exit stop_test_module(void)
       {
               mutex_lock(&mutex);
               flush_work(&work);
               mutex_unlock(&mutex);
       }
       module_exit(stop_test_module);
      
      would not always print a warning when flush_work() was called.
      In this trivial example nothing could go wrong since we are
      guaranteed module_init() and module_exit() don't run concurrently,
      but if the work item is schedule asynchronously we could have a
      scenario where the work item is running just at the time flush_work()
      is called resulting in a classic ABBA locking problem.
      
      Add a lockdep hint by acquiring and releasing the work item
      lockdep_map in flush_work() so that we always catch this
      potential deadlock scenario.
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0976dfc1
  5. 17 4月, 2012 1 次提交
  6. 13 3月, 2012 1 次提交
  7. 02 3月, 2012 1 次提交
    • A
      Block: use a freezable workqueue for disk-event polling · 62d3c543
      Alan Stern 提交于
      This patch (as1519) fixes a bug in the block layer's disk-events
      polling.  The polling is done by a work routine queued on the
      system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
      polling continues even in the middle of a system sleep transition.
      
      Obviously, polling a suspended drive for media changes and such isn't
      a good thing to do; in the case of USB mass-storage devices it can
      lead to real problems requiring device resets and even re-enumeration.
      
      The patch fixes things by creating a new system-wide, non-reentrant,
      freezable workqueue and using it for disk-events polling.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      CC: <stable@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62d3c543
  8. 11 1月, 2012 1 次提交
  9. 31 10月, 2011 1 次提交
  10. 15 9月, 2011 1 次提交
  11. 20 5月, 2011 1 次提交
    • T
      workqueue: separate out drain_workqueue() from destroy_workqueue() · 9c5a2ba7
      Tejun Heo 提交于
      There are users which want to drain workqueues without destroying it.
      Separate out drain functionality from destroy_workqueue() into
      drain_workqueue() and make it accessible to workqueue users.
      
      To guarantee forward-progress, only chain queueing is allowed while
      drain is in progress.  If a new work item which isn't chained from the
      running or pending work items is queued while draining is in progress,
      WARN_ON_ONCE() is triggered.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      9c5a2ba7
  12. 30 4月, 2011 1 次提交
  13. 31 3月, 2011 1 次提交
  14. 25 3月, 2011 1 次提交
    • T
      percpu: Always align percpu output section to PAGE_SIZE · 0415b00d
      Tejun Heo 提交于
      Percpu allocator honors alignment request upto PAGE_SIZE and both the
      percpu addresses in the percpu address space and the translated kernel
      addresses should be aligned accordingly.  The calculation of the
      former depends on the alignment of percpu output section in the kernel
      image.
      
      The linker script macros PERCPU_VADDR() and PERCPU() are used to
      define this output section and the latter takes @align parameter.
      Several architectures are using @align smaller than PAGE_SIZE breaking
      percpu memory alignment.
      
      This patch removes @align parameter from PERCPU(), renames it to
      PERCPU_SECTION() and makes it always align to PAGE_SIZE.  While at it,
      add PCPU_SETUP_BUG_ON() checks such that alignment problems are
      reliably detected and remove percpu alignment comment recently added
      in workqueue.c as the condition would trigger BUG way before reaching
      there.
      
      For um, this patch raises the alignment of percpu area.  As the area
      is in .init, there shouldn't be any noticeable difference.
      
      This problem was discovered by David Howells while debugging boot
      failure on mn10300.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Cc: uclinux-dist-devel@blackfin.uclinux.org
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      0415b00d
  15. 23 3月, 2011 1 次提交
  16. 08 3月, 2011 1 次提交
    • S
      debugobjects: Add hint for better object identification · 99777288
      Stanislaw Gruszka 提交于
      In complex subsystems like mac80211 structures can contain several
      timers and work structs, so identifying a specific instance from the
      call trace and object type output of debugobjects can be hard.
      
      Allow the subsystems which support debugobjects to provide a hint
      function. This function returns a pointer to a kernel address
      (preferrably the objects callback function) which is printed along
      with the debugobjects type.
      
      Add hint methods for timer_list, work_struct and hrtimer.
      
      [ tglx: Massaged changelog, made it compile ]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <20110307085809.GA9334@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      99777288
  17. 21 2月, 2011 1 次提交
  18. 17 2月, 2011 2 次提交
  19. 14 2月, 2011 1 次提交
  20. 09 2月, 2011 1 次提交
  21. 11 1月, 2011 2 次提交
    • T
      workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop · 42c025f3
      Tejun Heo 提交于
      The nested NOT_RUNNING test in worker_clr_flags() is slightly
      misleading in that if NOT_RUNNING were a single flag the nested test
      would be always %true and thus noop.  Add a comment noting that the
      test isn't a noop.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      42c025f3
    • T
      workqueue: relax lockdep annotation on flush_work() · e159489b
      Tejun Heo 提交于
      Currently, the lockdep annotation in flush_work() requires exclusive
      access on the workqueue the target work is queued on and triggers
      warning if a work is trying to flush another work on the same
      workqueue; however, this is no longer true as workqueues can now
      execute multiple works concurrently.
      
      This patch adds lock_map_acquire_read() and make process_one_work()
      hold read access to the workqueue while executing a work and
      start_flush_work() check for write access if concurrnecy level is one
      or the workqueue has a rescuer (as only one execution resource - the
      rescuer - is guaranteed to be available under memory pressure), and
      read access if higher.
      
      This better represents what's going on and removes spurious lockdep
      warnings which are triggered by fake dependency chain created through
      flush_work().
      
      * Peter pointed out that flushing another work from a WQ_MEM_RECLAIM
        wq breaks forward progress guarantee under memory pressure.
        Condition check accordingly updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      Tested-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@kernel.org
      e159489b
  22. 21 12月, 2010 1 次提交
    • T
      workqueue: allow chained queueing during destruction · c8efcc25
      Tejun Heo 提交于
      Currently, destroy_workqueue() makes the workqueue deny all new
      queueing by setting WQ_DYING and flushes the workqueue once before
      proceeding with destruction; however, there are cases where work items
      queue more related work items.  Currently, such users need to
      explicitly flush the workqueue multiple times depending on the
      possible depth of such chained queueing.
      
      This patch updates the queueing path such that a work item can queue
      further work items on the same workqueue even when WQ_DYING is set.
      The flush on destruction is automatically retried until the workqueue
      is empty.  This guarantees that the workqueue is empty on destruction
      while allowing chained queueing.
      
      The flush retry logic whines if it takes too many retries to drain the
      workqueue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      c8efcc25
  23. 14 12月, 2010 1 次提交
    • S
      workqueue: It is likely that WORKER_NOT_RUNNING is true · 2d64672e
      Steven Rostedt 提交于
      Running the annotate branch profiler on three boxes, including my
      main box that runs firefox, evolution, xchat, and is part of the distcc farm,
      showed this with the likelys in the workqueue code:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
            96   996253  99 wq_worker_sleeping             workqueue.c          703
            96   996247  99 wq_worker_waking_up            workqueue.c          677
      
      The likely()s in this case were assuming that WORKER_NOT_RUNNING will
      most likely be false. But this is not the case. The reason is
      (and shown by adding trace_printks and testing it) that most of the time
      WORKER_PREP is set.
      
      In worker_thread() we have:
      
      	worker_clr_flags(worker, WORKER_PREP);
      
      	[ do work stuff ]
      
      	worker_set_flags(worker, WORKER_PREP, false);
      
      (that 'false' means not to wake up an idle worker)
      
      The wq_worker_sleeping() is called from schedule when a worker thread
      is putting itself to sleep. Which happens most of the time outside
      of that [ do work stuff ].
      
      The wq_worker_waking_up is called by the wakeup worker code, which
      is also callod outside that [ do work stuff ].
      
      Thus, the likely and unlikely used by those two functions are actually
      backwards.
      
      Remove the annotation and let gcc figure it out.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2d64672e
  24. 26 11月, 2010 1 次提交
  25. 27 10月, 2010 1 次提交
  26. 26 10月, 2010 1 次提交
    • D
      MN10300: Fix the PERCPU() alignment to allow for workqueues · 52605627
      David Howells 提交于
      In the MN10300 arch, we occasionally see an assertion being tripped in
      alloc_cwqs() at the following line:
      
              /* just in case, make sure it's actually aligned */
        --->  BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
              return wq->cpu_wq.v ? 0 : -ENOMEM;
      
      The values are:
      
              wa->cpu_wq.v => 0x902776e0
              align => 0x100
      
      and align is calculated by the following:
      
              const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
                                         __alignof__(unsigned long long));
      
      This is because the pointer in question (wq->cpu_wq.v) loses some of its
      lower bits to control flags, and so the object it points to must be
      sufficiently aligned to avoid the need to use those bits for pointing to
      things.
      
      Currently, 4 control bits and 4 colour bits are used in normal
      circumstances, plus a debugging bit if debugging is set.  This requires
      the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512
      bytes aligned with debugging).
      
      PERCPU() alignment on MN13000, however, is only 32 bytes as set in
      vmlinux.lds.S.  So we set this to PAGE_SIZE (4096) to match most other
      arches and stick a comment in alloc_cwqs() for anyone else who triggers
      the assertion.
      Reported-by: NAkira Takeuchi <takeuchi.akr@jp.panasonic.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NMark Salter <msalter@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52605627
  27. 19 10月, 2010 2 次提交
  28. 11 10月, 2010 2 次提交
    • T
      workqueue: add and use WQ_MEM_RECLAIM flag · 6370a6ad
      Tejun Heo 提交于
      Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark
      WQ_RESCUER as internal and replace all external WQ_RESCUER usages to
      WQ_MEM_RECLAIM.
      
      This makes the API users express the intent of the workqueue instead
      of indicating the internal mechanism used to guarantee forward
      progress.  This is also to make it cleaner to add more semantics to
      WQ_MEM_RECLAIM.  For example, if deemed necessary, memory reclaim
      workqueues can be made highpri.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      6370a6ad
    • T
      workqueue: fix HIGHPRI handling in keep_working() · 30310045
      Tejun Heo 提交于
      The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING
      and could return %false with highpri work pending.  This could lead to
      late execution of a highpri work which was delayed due to @max_active
      throttling if other works are actively consuming CPU cycles.
      
      For example, the following could happen.
      
      1. Work W0 which burns CPU cycles.
      
      2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1.
      
      3. W1 starts executing and W2 is put to delayed queue.  W0 and W1 are
         both runnable.
      
      4. W1 finishes which puts W2 to pending queue but keep_working()
         incorrectly returns %false and the worker goes to sleep.
      
      5. W0 finishes and W2 starts execution.
      
      With this patch applied, W2 starts execution as soon as W1 finishes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      30310045
  29. 05 10月, 2010 2 次提交
    • T
      workqueue: add queue_work and activate_work trace points · cdadf009
      Tejun Heo 提交于
      These two tracepoints allow tracking when and how a work is queued and
      activated.  This patch is based on Frederic's patch to add queue_work
      trace point.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      cdadf009
    • T
      workqueue: prepare for more tracepoints · 97bd2347
      Tejun Heo 提交于
      Define workqueue_work event class and use it for workqueue_execute_end
      trace point.  Also, move trace/events/workqueue.h include downwards
      such that all struct definitions are visible to it.  This is to
      prepare for more tracepoints and doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      97bd2347
  30. 19 9月, 2010 1 次提交
    • T
      workqueue: implement flush[_delayed]_work_sync() · 09383498
      Tejun Heo 提交于
      Implement flush[_delayed]_work_sync().  These are flush functions
      which also make sure no CPU is still executing the target work from
      earlier queueing instances.  These are similar to
      cancel[_delayed]_work_sync() except that the target work item is
      flushed instead of cancelled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      09383498