1. 25 1月, 2013 14 次提交
    • T
      workqueue: remove worker_pool->gcwq · 4e8f0a60
      Tejun Heo 提交于
      The only remaining user of pool->gcwq is std_worker_pool_pri().
      Reimplement it using get_gcwq() and remove worker_pool->gcwq.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      4e8f0a60
    • T
      workqueue: replace for_each_worker_pool() with for_each_std_worker_pool() · 38db41d9
      Tejun Heo 提交于
      for_each_std_worker_pool() takes @cpu instead of @gcwq.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      38db41d9
    • T
      workqueue: make freezing/thawing per-pool · a1056305
      Tejun Heo 提交于
      Instead of holding locks from both pools and then processing the pools
      together, make freezing/thwaing per-pool - grab locks of one pool,
      process it, release it and then proceed to the next pool.
      
      While this patch changes processing order across pools, order within
      each pool remains the same.  As each pool is independent, this
      shouldn't break anything.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      a1056305
    • T
      workqueue: make hotplug processing per-pool · 94cf58bb
      Tejun Heo 提交于
      Instead of holding locks from both pools and then processing the pools
      together, make hotplug processing per-pool - grab locks of one pool,
      process it, release it and then proceed to the next pool.
      
      rebind_workers() is updated to take and process @pool instead of @gcwq
      which results in a lot of de-indentation.  gcwq_claim_assoc_and_lock()
      and its counterpart are replaced with in-line per-pool locking.
      
      While this patch changes processing order across pools, order within
      each pool remains the same.  As each pool is independent, this
      shouldn't break anything.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      94cf58bb
    • T
      workqueue: move global_cwq->lock to worker_pool · d565ed63
      Tejun Heo 提交于
      Move gcwq->lock to pool->lock.  The conversion is mostly
      straight-forward.  Things worth noting are
      
      * In many places, this removes the need to use gcwq completely.  pool
        is used directly instead.  get_std_worker_pool() is added to help
        some of these conversions.  This also leaves get_work_gcwq() without
        any user.  Removed.
      
      * In hotplug and freezer paths, the pools belonging to a CPU are often
        processed together.  This patch makes those paths hold locks of all
        pools, with highpri lock nested inside, to keep the conversion
        straight-forward.  These nested lockings will be removed by
        following patches.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      d565ed63
    • T
      workqueue: move global_cwq->cpu to worker_pool · ec22ca5e
      Tejun Heo 提交于
      Move gcwq->cpu to pool->cpu.  This introduces a couple places where
      gcwq->pools[0].cpu is used.  These will soon go away as gcwq is
      further reduced.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      ec22ca5e
    • T
      workqueue: move busy_hash from global_cwq to worker_pool · c9e7cf27
      Tejun Heo 提交于
      There's no functional necessity for the two pools on the same CPU to
      share the busy hash table.  It's also likely to be a bottleneck when
      implementing pools with user-specified attributes.
      
      This patch makes busy_hash per-pool.  The conversion is mostly
      straight-forward.  Changes worth noting are,
      
      * Large block of changes in rebind_workers() is moving the block
        inside for_each_worker_pool() as now there are separate hash tables
        for each pool.  This changes the order of operations but doesn't
        break anything.
      
      * Thre for_each_worker_pool() loops in gcwq_unbind_fn() are combined
        into one.  This again changes the order of operaitons but doesn't
        break anything.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      c9e7cf27
    • T
      workqueue: record pool ID instead of CPU in work->data when off-queue · 7c3eed5c
      Tejun Heo 提交于
      Currently, when a work item is off-queue, work->data records the CPU
      it was last on, which is used to locate the last executing instance
      for non-reentrance, flushing, etc.
      
      We're in the process of removing global_cwq and making worker_pool the
      top level abstraction.  This patch makes work->data point to the pool
      it was last associated with instead of CPU.
      
      After the previous WORK_OFFQ_POOL_CPU and worker_poo->id additions,
      the conversion is fairly straight-forward.  WORK_OFFQ constants and
      functions are modified to record and read back pool ID instead.
      worker_pool_by_id() is added to allow looking up pool from ID.
      get_work_pool() replaces get_work_gcwq(), which is reimplemented using
      get_work_pool().  get_work_pool_id() replaces work_cpu().
      
      This patch shouldn't introduce any observable behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      7c3eed5c
    • T
      workqueue: add worker_pool->id · 9daf9e67
      Tejun Heo 提交于
      Add worker_pool->id which is allocated from worker_pool_idr.  This
      will be used to record the last associated worker_pool in work->data.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      9daf9e67
    • T
      workqueue: introduce WORK_OFFQ_CPU_NONE · 715b06b8
      Tejun Heo 提交于
      Currently, when a work item is off queue, high bits of its data
      encodes the last CPU it was on.  This is scheduled to be changed to
      pool ID, which will make it impossible to use WORK_CPU_NONE to
      indicate no association.
      
      This patch limits the number of bits which are used for off-queue cpu
      number to 31 (so that the max fits in an int) and uses the highest
      possible value - WORK_OFFQ_CPU_NONE - to indicate no association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      715b06b8
    • T
      workqueue: make GCWQ_FREEZING a pool flag · 35b6bb63
      Tejun Heo 提交于
      Make GCWQ_FREEZING a pool flag POOL_FREEZING.  This patch doesn't
      change locking - FREEZING on both pools of a CPU are set or clear
      together while holding gcwq->lock.  It shouldn't cause any functional
      difference.
      
      This leaves gcwq->flags w/o any flags.  Removed.
      
      While at it, convert BUG_ON()s in freeze_workqueue_begin() and
      thaw_workqueues() to WARN_ON_ONCE().
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      35b6bb63
    • T
      workqueue: make GCWQ_DISASSOCIATED a pool flag · 24647570
      Tejun Heo 提交于
      Make GCWQ_DISASSOCIATED a pool flag POOL_DISASSOCIATED.  This patch
      doesn't change locking - DISASSOCIATED on both pools of a CPU are set
      or clear together while holding gcwq->lock.  It shouldn't cause any
      functional difference.
      
      This is part of an effort to remove global_cwq and make worker_pool
      the top level abstraction, which in turn will help implementing worker
      pools with user-specified attributes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      24647570
    • T
      workqueue: use std_ prefix for the standard per-cpu pools · e34cdddb
      Tejun Heo 提交于
      There are currently two worker pools per cpu (including the unbound
      cpu) and they are the only pools in use.  New class of pools are
      scheduled to be added and some pool related APIs will be added
      inbetween.  Call the existing pools the standard pools and prefix them
      with std_.  Do this early so that new APIs can use std_ prefix from
      the beginning.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      e34cdddb
    • T
      workqueue: unexport work_cpu() · e2905b29
      Tejun Heo 提交于
      This function no longer has any external users.  Unexport it.  It will
      be removed later on.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      e2905b29
  2. 19 1月, 2013 3 次提交
  3. 18 1月, 2013 1 次提交
    • T
      workqueue: set PF_WQ_WORKER on rescuers · 111c225a
      Tejun Heo 提交于
      PF_WQ_WORKER is used to tell scheduler that the task is a workqueue
      worker and needs wq_worker_sleeping/waking_up() invoked on it for
      concurrency management.  As rescuers never participate in concurrency
      management, PF_WQ_WORKER wasn't set on them.
      
      There's a need for an interface which can query whether %current is
      executing a work item and if so which.  Such interface requires a way
      to identify all tasks which may execute work items and PF_WQ_WORKER
      will be used for that.  As all normal workers always have PF_WQ_WORKER
      set, we only need to add it to rescuers.
      
      As rescuers start with WORKER_PREP but never clear it, it's always
      NOT_RUNNING and there's no need to worry about it interfering with
      concurrency management even if PF_WQ_WORKER is set; however, unlike
      normal workers, rescuers currently don't have its worker struct as
      kthread_data().  It uses the associated workqueue_struct instead.
      This is problematic as wq_worker_sleeping/waking_up() expect struct
      worker at kthread_data().
      
      This patch adds worker->rescue_wq and start rescuer kthreads with
      worker struct as kthread_data and sets PF_WQ_WORKER on rescuers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      111c225a
  4. 20 12月, 2012 1 次提交
    • T
      workqueue: fix find_worker_executing_work() brekage from hashtable conversion · 023f27d3
      Tejun Heo 提交于
      42f8570f ("workqueue: use new hashtable implementation") incorrectly
      made busy workers hashed by the pointer value of worker instead of
      work.  This broke find_worker_executing_work() which in turn broke a
      lot of fundamental operations of workqueue - non-reentrancy and
      flushing among others.  The flush malfunction triggered warning in
      disk event code in Fengguang's automated test.
      
       write_dev_root_ (3265) used greatest stack depth: 2704 bytes left
       ------------[ cut here ]------------
       WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\
      cf/0x108()
       Hardware name: Bochs
       Modules linked in:
       Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167
       Call Trace:
        [<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c
        [<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c
        [<ffffffff816aea77>] disk_clear_events+0xcf/0x108
        [<ffffffff811bd8be>] check_disk_change+0x27/0x59
        [<ffffffff822e48e2>] cdrom_open+0x49/0x68b
        [<ffffffff81ab0291>] idecd_open+0x88/0xb7
        [<ffffffff811be58f>] __blkdev_get+0x102/0x3ec
        [<ffffffff811bea08>] blkdev_get+0x18f/0x30f
        [<ffffffff811bebfd>] blkdev_open+0x75/0x80
        [<ffffffff8118f510>] do_dentry_open+0x1ea/0x295
        [<ffffffff8118f5f0>] finish_open+0x35/0x41
        [<ffffffff8119c720>] do_last+0x878/0xa25
        [<ffffffff8119c993>] path_openat+0xc6/0x333
        [<ffffffff8119cf37>] do_filp_open+0x38/0x86
        [<ffffffff81190170>] do_sys_open+0x6c/0xf9
        [<ffffffff8119021e>] sys_open+0x21/0x23
        [<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      023f27d3
  5. 19 12月, 2012 2 次提交
    • T
      workqueue: consider work function when searching for busy work items · a2c1c57b
      Tejun Heo 提交于
      To avoid executing the same work item concurrenlty, workqueue hashes
      currently busy workers according to their current work items and looks
      up the the table when it wants to execute a new work item.  If there
      already is a worker which is executing the new work item, the new item
      is queued to the found worker so that it gets executed only after the
      current execution finishes.
      
      Unfortunately, a work item may be freed while being executed and thus
      recycled for different purposes.  If it gets recycled for a different
      work item and queued while the previous execution is still in
      progress, workqueue may make the new work item wait for the old one
      although the two aren't really related in any way.
      
      In extreme cases, this false dependency may lead to deadlock although
      it's extremely unlikely given that there aren't too many self-freeing
      work item users and they usually don't wait for other work items.
      
      To alleviate the problem, record the current work function in each
      busy worker and match it together with the work item address in
      find_worker_executing_work().  While this isn't complete, it ensures
      that unrelated work items don't interact with each other and in the
      very unlikely case where a twisted wq user triggers it, it's always
      onto itself making the culprit easy to spot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndrey Isakov <andy51@gmx.ru>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701
      Cc: stable@vger.kernel.org
      a2c1c57b
    • S
      workqueue: use new hashtable implementation · 42f8570f
      Sasha Levin 提交于
      Switch workqueues to use the new hashtable implementation. This reduces the
      amount of generic unrelated code in the workqueues.
      
      This patch depends on d9b482c8 ("hashtable: introduce a small and naive
      hashtable") which was merged in v3.6.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      42f8570f
  6. 18 12月, 2012 9 次提交
  7. 14 12月, 2012 1 次提交
  8. 13 12月, 2012 3 次提交
  9. 11 12月, 2012 6 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman 提交于
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      3105b86a
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      sched: numa: Slowly increase the scanning period as NUMA faults are handled · fb003b80
      Mel Gorman 提交于
      Currently the rate of scanning for an address space is controlled
      by the individual tasks. The next scan is simply determined by
      2*p->numa_scan_period.
      
      The 2*p->numa_scan_period is arbitrary and never changes. At this point
      there is still no proper policy that decides if a task or process is
      properly placed. It just scans and assumes the next NUMA fault will
      place it properly. As it is assumed that pages will get properly placed
      over time, increase the scan window each time a fault is incurred. This
      is a big assumption as noted in the comments.
      
      It should be noted that changing to p->numa_scan_period will increase
      system CPU usage because now the scanning rate has effectively doubled.
      If that is a problem then the min_rate should be made 200ms instead of
      restoring the 2* logic.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      fb003b80
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4