1. 09 6月, 2010 10 次提交
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
    • V
      sched: Avoid side-effect of tickless idle on update_cpu_load · fdf3e95d
      Venkatesh Pallipadi 提交于
      tickless idle has a negative side effect on update_cpu_load(), which
      in turn can affect load balancing behavior.
      
      update_cpu_load() is supposed to be called every tick, to keep track
      of various load indicies. With tickless idle, there are no scheduler
      ticks called on the idle CPUs. Idle CPUs may still do load balancing
      (with idle_load_balance CPU) using the stale cpu_load. It will also
      cause problems when all CPUs go idle for a while and become active
      again. In this case loads would not degrade as expected.
      
      This is how rq->nr_load_updates change looks like under different
      conditions:
      
      <cpu_num> <nr_load_updates change>
      All CPUS idle for 10 seconds (HZ=1000)
      0 1621
      10 496
      11 139
      12 875
      13 1672
      14 12
      15 21
      1 1472
      2 2426
      3 1161
      4 2108
      5 1525
      6 701
      7 249
      8 766
      9 1967
      
      One CPU busy rest idle for 10 seconds
      0 10003
      10 601
      11 95
      12 966
      13 1597
      14 114
      15 98
      1 3457
      2 93
      3 6679
      4 1425
      5 1479
      6 595
      7 193
      8 633
      9 1687
      
      All CPUs busy for 10 seconds
      0 10026
      10 10026
      11 10026
      12 10026
      13 10025
      14 10025
      15 10025
      1 10026
      2 10026
      3 10026
      4 10026
      5 10026
      6 10026
      7 10026
      8 10026
      9 10026
      
      That is update_cpu_load works properly only when all CPUs are busy.
      If all are idle, all the CPUs get way lower updates.  And when few
      CPUs are busy and rest are idle, only busy and ilb CPU does proper
      updates and rest of the idle CPUs will do lower updates.
      
      The patch keeps track of when a last update was done and fixes up
      the load avg based on current time.
      
      On one of my test system SPECjbb with warehouse 1..numcpus, patch
      improves throughput numbers by ~1% (average of 6 runs).  On another
      test system (with different domain hierarchy) there is no noticable
      change in perf.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fdf3e95d
    • O
      sched: Simplify the reacquire_kernel_lock() logic · 246d86b5
      Oleg Nesterov 提交于
      - Contrary to what 6d558c3a says, there is no need to reload
        prev = rq->curr after the context switch. You always schedule
        back to where you came from, prev must be equal to current
        even if cpu/rq was changed.
      
      - This also means reacquire_kernel_lock() can use prev instead
        of current.
      
      - No need to reassign switch_count if reacquire_kernel_lock()
        reports need_resched(), we can just move the initial assignment
        down, under the "need_resched_nonpreemptible:" label.
      
      - Try to update the comment after context_switch().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100519125711.GA30199@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      246d86b5
    • P
      sched_clock: Add local_clock() API and improve documentation · c676329a
      Peter Zijlstra 提交于
      For people who otherwise get to write: cpu_clock(smp_processor_id()),
      there is now: local_clock().
      
      Also, as per suggestion from Andrew, provide some documentation on
      the various clock interfaces, and minimize the unsigned long long vs
      u64 mess.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      LKML-Reference: <1275052414.1645.52.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c676329a
    • I
    • T
      sched: add hooks for workqueue · 21aa9af0
      Tejun Heo 提交于
      Concurrency managed workqueue needs to know when workers are going to
      sleep and waking up.  Using these two hooks, cmwq keeps track of the
      current concurrency level and throttles execution of new works if it's
      too high and wakes up another worker from the sleep hook if it becomes
      too low.
      
      This patch introduces PF_WQ_WORKER to identify workqueue workers and
      adds the following two hooks.
      
      * wq_worker_waking_up(): called when a worker is woken up.
      
      * wq_worker_sleeping(): called when a worker is going to sleep and may
        return a pointer to a local task which should be woken up.  The
        returned task is woken up using try_to_wake_up_local() which is
        simplified ttwu which is called under rq lock and can only wake up
        local tasks.
      
      Both hooks are currently defined as noop in kernel/workqueue_sched.h.
      Later cmwq implementation will replace them with proper
      implementation.
      
      These hooks are hard coded as they'll always be enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      21aa9af0
    • T
      sched: refactor try_to_wake_up() · 9ed3811a
      Tejun Heo 提交于
      Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
      The factoring out doesn't affect try_to_wake_up() much
      code-generation-wise.  Depending on configuration options, it ends up
      generating the same object code as before or slightly different one
      due to different register assignment.
      
      This is to help future implementation of try_to_wake_up_local().
      
      Mike Galbraith suggested rename to ttwu_post_activation() from
      ttwu_woken_up() and comment update in try_to_wake_up().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      9ed3811a
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
    • T
      sched: define and use CPU_PRI_* enums for cpu notifier priorities · 50a323b7
      Tejun Heo 提交于
      Instead of hardcoding priority 10 and 20 in sched and perf, collect
      them into CPU_PRI_* enums.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      50a323b7
    • P
      sched: Fix PROVE_RCU vs cpu_cgroup · dc61b1d6
      Peter Zijlstra 提交于
      PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
      typically holds rq->lock around the css rcu derefs but the generic
      cgroup code doesn't (and can't) know about that lock.
      
      Provide means to add extra checks to the css dereference and use that
      in the scheduler to annotate its users.
      
      The addition of rq->lock to these checks is correct because the
      cgroup_subsys::attach() method takes the rq->lock for each task it
      moves, therefore by holding that lock, we ensure the task is pinned to
      the current cgroup and the RCU derefence is valid.
      
      That leaves one genuine race in __sched_setscheduler() where we used
      task_group() without holding any of the required locks and thus raced
      with the cgroup code. Solve this by moving the check under the
      appropriate lock.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc61b1d6
  2. 08 6月, 2010 6 次提交
  3. 07 6月, 2010 2 次提交
  4. 06 6月, 2010 4 次提交
  5. 05 6月, 2010 18 次提交
    • D
      ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags · 84a8dce2
      Dmitry Monakhov 提交于
      A few functions were still modifying i_flags in a racy manner.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      84a8dce2
    • L
      Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs · 6c5de280
      Linus Torvalds 提交于
      * 'for-linus' of git://oss.sgi.com/xfs/xfs:
        xfs: improve xfs_isilocked
        xfs: skip writeback from reclaim context
        xfs: remove done roadmap item from xfs-delayed-logging-design.txt
        xfs: fix race in inode cluster freeing failing to stale inodes
        xfs: fix access to upper inodes without inode64
        xfs: fix might_sleep() warning when initialising per-ag tree
        fs/xfs/quota: Add missing mutex_unlock
        xfs: remove duplicated #include
        xfs: convert more trace events to DEFINE_EVENT
        xfs: xfs_trace.c: remove duplicated #include
        xfs: Check new inode size is OK before preallocating
        xfs: clean up xlog_align
        xfs: cleanup log reservation calculactions
        xfs: be more explicit if RT mount fails due to config
        xfs: replace E2BIG with EFBIG where appropriate
      6c5de280
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · ed7dc1df
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
        X25: remove duplicated #include
        tcp: use correct net ns in cookie_v4_check()
        rps: tcp: fix rps_sock_flow_table table updates
        ppp_generic: fix multilink fragment sizes
        syncookies: remove Kconfig text line about disabled-by-default
        ixgbe: only check pfc bits in hang logic if pfc is enabled
        net: check for refcount if pop a stacked dst_entry
        ixgbe: return IXGBE_ERR_RAR_INDEX when out of range
        act_pedit: access skb->data safely
        sfc: Store port number in net_device::dev_id
        epic100: Test __BIG_ENDIAN instead of (non-existent) CONFIG_BIG_ENDIAN
        tehuti: return -EFAULT on copy_to_user errors
        isdn/kcapi: return -EFAULT on copy_from_user errors
        e1000e: change logical negate to bitwise
        sfc: Get port number from CS_PORT_NUM, not PCI function number
        cls_u32: use skb_header_pointer() to dereference data safely
        TCP: tcp_hybla: Fix integer overflow in slow start increment
        act_nat: fix the wrong checksum when addr isn't in old_addr/mask
        net/fec: fix pm to survive to suspend/resume
        korina: count RX DMA OVR as rx_fifo_error
        ...
      ed7dc1df
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 · 7926e0bf
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
        nilfs2: remove obsolete declarations of cache constructor and destructor
        nilfs2: fix style issue in nilfs_destroy_cachep
      7926e0bf
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 · 7f0d384c
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
        Minix: Clean up left over label
        fix truncate inode time modification breakage
        fix setattr error handling in sysfs, configfs
        fcntl: return -EFAULT if copy_to_user fails
        wrong type for 'magic' argument in simple_fill_super()
        fix the deadlock in qib_fs
        mqueue doesn't need make_bad_inode()
      7f0d384c
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus · 90ec7819
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
        module: fix bne2 "gave up waiting for init of module libcrc32c"
        module: verify_export_symbols under the lock
        module: move find_module check to end
        module: make locking more fine-grained.
        module: Make module sysfs functions private.
        module: move sysfs exposure to end of load_module
        module: fix kdb's illicit use of struct module_use.
        module: Make the 'usage' lists be two-way
      90ec7819
    • R
      module: fix bne2 "gave up waiting for init of module libcrc32c" · 9bea7f23
      Rusty Russell 提交于
      Problem: it's hard to avoid an init routine stumbling over a
      request_module these days.  And it's not clear it's always a bad idea:
      for example, a module like kvm with dynamic dependencies on kvm-intel
      or kvm-amd would be neater if it could simply request_module the right
      one.
      
      In this particular case, it's libcrc32c:
      
      	libcrc32c_mod_init
      	 crypto_alloc_shash
      	  crypto_alloc_tfm
      	   crypto_find_alg
      	    crypto_alg_mod_lookup
      	     crypto_larval_lookup
      	      request_module
      
      If another module is waiting inside resolve_symbol() for libcrc32c to
      finish initializing (ie. bne2 depends on libcrc32c) then it does so
      holding the module lock, and our request_module() can't make progress
      until that is released.
      
      Waiting inside resolve_symbol() without the lock isn't all that hard:
      we just need to pass the -EBUSY up the call chain so we can sleep
      where we don't hold the lock.  Error reporting is a bit trickier: we
      need to copy the name of the unfinished module before releasing the
      lock.
      
      Other notes:
      1) This also fixes a theoretical issue where a weak dependency would allow
         symbol version mismatches to be ignored.
      2) We rename use_module to ref_module to make life easier for the only
         external user (the out-of-tree ksplice patches).
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tim Abbot <tabbott@ksplice.com>
      Tested-by: NBrandon Philips <bphilips@suse.de>
      9bea7f23
    • R
      module: verify_export_symbols under the lock · be593f4c
      Rusty Russell 提交于
      It disabled preempt so it was "safe", but nothing stops another module
      slipping in before this module is added to the global list now we don't
      hold the lock the whole time.
      
      So we check this just after we check for duplicate modules, and just
      before we put the module in the global list.
      
      (find_symbol finds symbols in coming and going modules, too).
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      be593f4c
    • L
      module: move find_module check to end · 3bafeb62
      Linus Torvalds 提交于
      I think Rusty may have made the lock a bit _too_ finegrained there, and
      didn't add it to some places that needed it. It looks, for example, like
      PATCH 1/2 actually drops the lock in places where it's needed
      ("find_module()" is documented to need it, but now load_module() didn't
      hold it at all when it did the find_module()).
      
      Rather than adding a new "module_loading" list, I think we should be able
      to just use the existing "modules" list, and just fix up the locking a
      bit.
      
      In fact, maybe we could just move the "look up existing module" a bit
      later - optimistically assuming that the module doesn't exist, and then
      just undoing the work if it turns out that we were wrong, just before
      adding ourselves to the list.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      3bafeb62
    • R
      module: make locking more fine-grained. · 75676500
      Rusty Russell 提交于
      Kay Sievers <kay.sievers@vrfy.org> reports that we still have some
      contention over module loading which is slowing boot.
      
      Linus also disliked a previous "drop lock and regrab" patch to fix the
      bne2 "gave up waiting for init of module libcrc32c" message.
      
      This is more ambitious: we only grab the lock where we need it.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Brandon Philips <brandon@ifup.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      75676500
    • R
      module: Make module sysfs functions private. · 6407ebb2
      Rusty Russell 提交于
      These were placed in the header in ef665c1a to get the various
      SYSFS/MODULE config combintations to compile.
      
      That may have been necessary then, but it's not now.  These functions
      are all local to module.c.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      6407ebb2
    • R
      module: move sysfs exposure to end of load_module · 80a3d1bb
      Rusty Russell 提交于
      This means a little extra work, but is more logical: we don't put
      anything in sysfs until we're about to put the module into the
      global list an parse its parameters.
      
      This also gives us a logical place to put duplicate module detection
      in the next patch.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      80a3d1bb
    • R
      module: fix kdb's illicit use of struct module_use. · c8e21ced
      Rusty Russell 提交于
      Linus changed the structure, and luckily this didn't compile any more.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Martin Hicks <mort@sgi.com>
      c8e21ced
    • L
      module: Make the 'usage' lists be two-way · 2c02dfe7
      Linus Torvalds 提交于
      When adding a module that depends on another one, we used to create a
      one-way list of "modules_which_use_me", so that module unloading could
      see who needs a module.
      
      It's actually quite simple to make that list go both ways: so that we
      not only can see "who uses me", but also see a list of modules that are
      "used by me".
      
      In fact, we always wanted that list in "module_unload_free()": when we
      unload a module, we want to also release all the other modules that are
      used by that module.  But because we didn't have that list, we used to
      first iterate over all modules, and then iterate over each "used by me"
      list of that module.
      
      By making the list two-way, we simplify module_unload_free(), and it
      allows for some trivial fixes later too.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)
      2c02dfe7
    • H
      X25: remove duplicated #include · ca733594
      Huang Weiyi 提交于
      Remove duplicated #include('s) in drivers/net/wan/x25_asy.c
      Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca733594
    • E
      tcp: use correct net ns in cookie_v4_check() · c4464921
      Eric Dumazet 提交于
      Its better to make a route lookup in appropriate namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4464921
    • E
      rps: tcp: fix rps_sock_flow_table table updates · ca55158c
      Eric Dumazet 提交于
      I believe a moderate SYN flood attack can corrupt RFS flow table
      (rps_sock_flow_table), making RPS/RFS much less effective.
      
      Even in a normal situation, server handling short lived sessions suffer
      from bad steering for the first data packet of a session, if another SYN
      packet is received for another session.
      
      We do following action in tcp_v4_rcv() :
      
      	sock_rps_save_rxhash(sk, skb->rxhash);
      
      We should _not_ do this if sk is a LISTEN socket, as about each
      packet received on a LISTEN socket has a different rxhash than
      previous one.
       -> RPS_NO_CPU markers are spread all over rps_sock_flow_table.
      
      Also, it makes sense to protect sk->rxhash field changes with socket
      lock (We currently can change it even if user thread owns the lock
      and might use rxhash)
      
      This patch moves sock_rps_save_rxhash() to a sock locked section,
      and only for non LISTEN sockets.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca55158c
    • B
      ppp_generic: fix multilink fragment sizes · 536e00e5
      Ben McKeegan 提交于
      Fix bug in multilink fragment size calculation introduced by
      commit 9c705260
      "ppp: ppp_mp_explode() redesign"
      Signed-off-by: NBen McKeegan <ben@netservers.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      536e00e5