1. 10 7月, 2007 8 次提交
  2. 09 6月, 2007 1 次提交
    • A
      pi-futex: fix exit races and locking problems · 778e9a9c
      Alexey Kuznetsov 提交于
      1. New entries can be added to tsk->pi_state_list after task completed
         exit_pi_state_list(). The result is memory leakage and deadlocks.
      
      2. handle_mm_fault() is called under spinlock. The result is obvious.
      
      3. results in self-inflicted deadlock inside glibc.
         Sometimes futex_lock_pi returns -ESRCH, when it is not expected
         and glibc enters to for(;;) sleep() to simulate deadlock. This problem
         is quite obvious and I think the patch is right. Though it looks like
         each "if" in futex_lock_pi() got some stupid special case "else if". :-)
      
      4. sometimes futex_lock_pi() returns -EDEADLK,
         when nobody has the lock. The reason is also obvious (see comment
         in the patch), but correct fix is far beyond my comprehension.
         I guess someone already saw this, the chunk:
      
                              if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                      ret = 0;
      
         is obviously from the same opera. But it does not work, because the
         rtmutex is really taken at this point: wake_futex_pi() of previous
         owner reassigned it to us. My fix works. But it looks very stupid.
         I would think about removal of shift of ownership in wake_futex_pi()
         and making all the work in context of process taking lock.
      
      From: Thomas Gleixner <tglx@linutronix.de>
      
      Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
          an additional state transition to the exit code.
      
          This fixes also the issue, when a task with recursive segfaults
          is not able to release the futexes.
      
      Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
          problem finally.
      
      Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
          in the lock protected section by using the in_atomic userspace access
          functions.
      
          This removes also the ugly lock drop / unqueue inside of fixup_pi_state()
      
      Fix 4) Fix a stale lock in the error path of futex_wake_pi()
      
      Added some error checks for verification.
      
      The -EDEADLK problem is solved by the rtmutex fixups.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778e9a9c
  3. 24 5月, 2007 2 次提交
    • R
      recalc_sigpending_tsk fixes · 7bb44ade
      Roland McGrath 提交于
      Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
      do_sigaction but no signal_wake_up call was made, preventing later signals
      from waking up blocked threads with TIF_SIGPENDING already set.
      
      In fact, the few other calls to recalc_sigpending_tsk outside the signals
      code are also subject to this problem in other race conditions.
      
      This change makes recalc_sigpending_tsk private to the signals code.  It
      changes the outside calls, as well as do_sigaction, to use the new
      recalc_sigpending_and_wake instead.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: <Steve.Hawkes@motorola.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bb44ade
    • R
      freezer: fix vfork problem · ba96a0c8
      Rafael J. Wysocki 提交于
      Currently try_to_freeze_tasks() has to wait until all of the vforked processes
      exit and for this reason every user can make it fail.  To fix this problem we
      can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
      that do not want to be counted as freezable by the freezer and want to have
      TIF_FREEZE set nevertheless.  Then, this flag can be set by tasks using
      sys_vfork() before they call wait_for_completion(&vfork) and cleared after
      they have woken up.  After clearing it, the tasks should call try_to_freeze()
      as soon as possible.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba96a0c8
  4. 11 5月, 2007 3 次提交
    • D
      signal/timer/event: signalfd core · fba2afaa
      Davide Libenzi 提交于
      This patch series implements the new signalfd() system call.
      
      I took part of the original Linus code (and you know how badly it can be
      broken :), and I added even more breakage ;) Signals are fetched from the same
      signal queue used by the process, so signalfd will compete with standard
      kernel delivery in dequeue_signal().  If you want to reliably fetch signals on
      the signalfd file, you need to block them with sigprocmask(SIG_BLOCK).  This
      seems to be working fine on my Dual Opteron machine.  I made a quick test
      program for it:
      
      http://www.xmailserver.org/signafd-test.c
      
      The signalfd() system call implements signal delivery into a file descriptor
      receiver.  The signalfd file descriptor if created with the following API:
      
      int signalfd(int ufd, const sigset_t *mask, size_t masksize);
      
      The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
      to close/create cycle (Linus idea).  Use "ufd" == -1 if you want a brand new
      signalfd file.
      
      The "mask" allows to specify the signal mask of signals that we are interested
      in.  The "masksize" parameter is the size of "mask".
      
      The signalfd fd supports the poll(2) and read(2) system calls.  The poll(2)
      will return POLLIN when signals are available to be dequeued.  As a direct
      consequence of supporting the Linux poll subsystem, the signalfd fd can use
      used together with epoll(2) too.
      
      The read(2) system call will return a "struct signalfd_siginfo" structure in
      the userspace supplied buffer.  The return value is the number of bytes copied
      in the supplied buffer, or -1 in case of error.  The read(2) call can also
      return 0, in case the sighand structure to which the signalfd was attached,
      has been orphaned.  The O_NONBLOCK flag is also supported, and read(2) will
      return -EAGAIN in case no signal is available.
      
      If the size of the buffer passed to read(2) is lower than sizeof(struct
      signalfd_siginfo), -EINVAL is returned.  A read from the signalfd can also
      return -ERESTARTSYS in case a signal hits the process.  The format of the
      struct signalfd_siginfo is, and the valid fields depends of the (->code &
      __SI_MASK) value, in the same way a struct siginfo would:
      
      struct signalfd_siginfo {
      	__u32 signo;	/* si_signo */
      	__s32 err;	/* si_errno */
      	__s32 code;	/* si_code */
      	__u32 pid;	/* si_pid */
      	__u32 uid;	/* si_uid */
      	__s32 fd;	/* si_fd */
      	__u32 tid;	/* si_fd */
      	__u32 band;	/* si_band */
      	__u32 overrun;	/* si_overrun */
      	__u32 trapno;	/* si_trapno */
      	__s32 status;	/* si_status */
      	__s32 svint;	/* si_int */
      	__u64 svptr;	/* si_ptr */
      	__u64 utime;	/* si_utime */
      	__u64 stime;	/* si_stime */
      	__u64 addr;	/* si_addr */
      };
      
      [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fba2afaa
    • E
      getrusage(): fill ru_inblock and ru_oublock fields if possible · 6eaeeaba
      Eric Dumazet 提交于
      If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
      each task.
      
      This patch permits reporting of values using the well known getrusage()
      syscall, filling ru_inblock and ru_oublock instead of null values.
      
      As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
      count doing : nr_blocks = nr_bytes / 512
      
      Example of use :
      ----------------------
      After patch is applied, /usr/bin/time command can now give a good
      approximation of IO that the process had to do.
      
      $ /usr/bin/time grep tototo /usr/include/*
      Command exited with non-zero status 1
      0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
      24288inputs+0outputs (0major+259minor)pagefaults 0swaps
      
      $ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
      1000+0 enregistrements lus
      1000+0 enregistrements écrits
      512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
      0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+3000outputs (0major+299minor)pagefaults 0swaps
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6eaeeaba
    • N
      When stacked block devices are in-use (e.g. md or dm), the recursive calls · d89d8796
      Neil Brown 提交于
      to generic_make_request can use up a lot of space, and we would rather they
      didn't.
      
      As generic_make_request is a void function, and as it is generally not
      expected that it will have any effect immediately, it is safe to delay any
      call to generic_make_request until there is sufficient stack space
      available.
      
      As ->bi_next is reserved for the driver to use, it can have no valid value
      when generic_make_request is called, and as __make_request implicitly
      assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
      certain that all callers set it to NULL.  We can therefore safely use
      bi_next to link pending requests together, providing we clear it before
      making the real call.
      
      So, we choose to allow each thread to only be active in one
      generic_make_request at a time.  If a subsequent (recursive) call is made,
      the bio is linked into a per-thread list, and is handled when the active
      call completes.
      
      As the list of pending bios is per-thread, there are no locking issues to
      worry about.
      
      I say above that it is "safe to delay any call...".  There are, however,
      some behaviours of a make_request_fn which would make it unsafe.  These
      include any behaviour that assumes anything will have changed after a
      recursive call to generic_make_request.
      
      These could include:
       - waiting for that call to finish and call it's bi_end_io function.
         md use to sometimes do this (marking the superblock dirty before
         completing a write) but doesn't any more
       - inspecting the bio for fields that generic_make_request might
         change, such as bi_sector or bi_bdev.  It is hard to see a good
         reason for this, and I don't think anyone actually does it.
       - inspecing the queue to see if, e.g. it is 'full' yet.  Again, I
         think this is very unlikely to be useful, or to be done.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <dm-devel@redhat.com>
      
      Alasdair G Kergon <agk@redhat.com> said:
      
       I can see nothing wrong with this in principle.
      
       For device-mapper at the moment though it's essential that, while the bio
       mappings may now get delayed, they still get processed in exactly
       the same order as they were passed to generic_make_request().
      
       My main concern is whether the timing changes implicit in this patch
       will make the rare data-corrupting races in the existing snapshot code
       more likely. (I'm working on a fix for these races, but the unfinished
       patch is already several hundred lines long.)
      
       It would be helpful if some people on this mailing list would test
       this patch in various scenarios and report back.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d89d8796
  5. 10 5月, 2007 2 次提交
    • R
      rename thread_info to stack · f7e4217b
      Roman Zippel 提交于
      This finally renames the thread_info field in task structure to stack, so that
      the assumptions about this field are gone and archs have more freedom about
      placing the thread_info structure.
      
      Nonbroken archs which have a proper thread pointer can do the access to both
      current thread and task structure via a single pointer.
      
      It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
      could benefit.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      [akpm@linux-foundation.org: build fix]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e4217b
    • O
      change kernel threads to ignore signals instead of blocking them · 10ab825b
      Oleg Nesterov 提交于
      Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
      signals.  This doesn't prevent the signal delivery, this only blocks
      signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
      leak.
      
      Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
      them (make a new helper ignore_signals() for that).  If the kernel thread
      needs some signal, it should use allow_signal() anyway, and in that case it
      should not use CLONE_SIGHAND.
      
      Note that we can't change daemonize() (should die!) in the same way,
      because it can be used along with CLONE_SIGHAND.  This means that
      allow_signal() still should unblock the signal to work correctly with
      daemonize()ed threads.
      
      However, disallow_signal() doesn't block the signal any longer but ignores
      it.
      
      NOTE: with or without this patch the kernel threads are not protected from
      handle_stop_signal(), this seems harmless, but not good.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10ab825b
  6. 09 5月, 2007 5 次提交
    • E
      Speed up divides by cpu_power in scheduler · 5517d86b
      Eric Dumazet 提交于
      I noticed expensive divides done in try_to_wakeup() and
      find_busiest_group() on a bi dual core Opteron machine (total of 4 cores),
      moderatly loaded (15.000 context switch per second)
      
      oprofile numbers :
      
      CPU: AMD64 processors, speed 2600.05 MHz (estimated)
      Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
      mask of 0x00 (No unit mask) count 50000
      samples  %        symbol name
      ...
      613914    1.0498  try_to_wake_up
          834  0.0013 :ffffffff80227ae1:   div    %rcx
      77513  0.1191 :ffffffff80227ae4:   mov    %rax,%r11
      
      608893    1.0413  find_busiest_group
         1841  0.0031 :ffffffff802260bf:       div    %rdi
      140109  0.2394 :ffffffff802260c2:       test   %sil,%sil
      
      Some of these divides can use the reciprocal divides we introduced some
      time ago (currently used in slab AFAIK)
      
      We can assume a load will fit in a 32bits number, because with a
      SCHED_LOAD_SCALE=128 value, its still a theorical limit of 33554432
      
      When/if we reach this limit one day, probably cpus will have a fast
      hardware divide and we can zap the reciprocal divide trick.
      
      Ingo suggested to rename cpu_power to __cpu_power to make clear it should
      not be modified without changing its reciprocal value too.
      
      I did not convert the divide in cpu_avg_load_per_task(), because tracking
      nr_running changes may be not worth it ?  We could use a static table of 32
      reciprocal values but it would add a conditional branch and table lookup.
      
      [akpm@linux-foundation.org: !SMP build fix]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5517d86b
    • S
      sched: dynticks idle load balancing · 46cb4b7c
      Siddha, Suresh B 提交于
      Fix the process idle load balancing in the presence of dynticks.  cpus for
      which ticks are stopped will sleep till the next event wakes it up.
      Potentially these sleeps can be for large durations and during which today,
      there is no periodic idle load balancing being done.
      
      This patch nominates an owner among the idle cpus, which does the idle load
      balancing on behalf of the other idle cpus.  And once all the cpus are
      completely idle, then we can stop this idle load balancing too.  Checks added
      in fast path are minimized.  Whenever there are busy cpus in the system, there
      will be an owner(idle cpu) doing the system wide idle load balancing.
      
      Open items:
      1. Intelligent owner selection (like an idle core in a busy package).
      2. Merge with rcu's nohz_cpu_mask?
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46cb4b7c
    • J
      add touch_all_softlockup_watchdogs() · 04c9167f
      Jeremy Fitzhardinge 提交于
      Add touch_all_softlockup_watchdogs() to allow the softlockup watchdog
      timers on all cpus to be updated.  This is used to prevent sysrq-t from
      generating a spurious watchdog message when generating lots of output.
      
      Softlockup watchdogs use sched_clock() as its timebase, which is inherently
      per-cpu (at least, when it is measuring unstolen time).  Because of this,
      it isn't possible for one CPU to directly update the other CPU's timers,
      but it is possible to tell the other CPUs to do update themselves
      appropriately.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: NChris Lalancette <clalance@redhat.com>
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Rick Lindsley <ricklind@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c9167f
    • R
      <linux/sysdev.h> needs to include <linux/module.h> · 3367b994
      Ralf Baechle 提交于
      sysdev.h uses THIS_MODULE so should include <linux/module.h>.
      
      [akpm@linux-foundation.org: couple of fixes]
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3367b994
    • W
      reduce size of task_struct on 64-bit machines · 97dc32cd
      William Cohen 提交于
      This past week I was playing around with that pahole tool
      (http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
      of various struct in the kernel.  I was surprised by the size of the
      task_struct on x86_64, approaching 4K.  I looked through the fields in
      task_struct and found that a number of them were declared as "unsigned
      long" rather than "unsigned int" despite them appearing okay as 32-bit
      sized fields.  On x86_64 "unsigned long" ends up being 8 bytes in size and
      forces 8 byte alignment.  Is there a reason there a reason they are
      "unsigned long"?
      
      The patch below drops the size of the struct from 3808 bytes (60 64-byte
      cachelines) to 3760 bytes (59 64-byte cachelines).  A couple other fields
      in the task struct take a signficant amount of space:
      
      struct thread_struct       thread;               688
      struct held_lock           held_locks[30];       1680
      
      CONFIG_LOCKDEP is turned on in the .config
      
      [akpm@linux-foundation.org: fix printk warnings]
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97dc32cd
  7. 28 4月, 2007 1 次提交
  8. 05 3月, 2007 1 次提交
    • C
      [PATCH] sched: remove SMT nice · 69f7c0a1
      Con Kolivas 提交于
      Remove the SMT-nice feature which idles sibling cpus on SMT cpus to
      facilitiate nice working properly where cpu power is shared.  The idling of
      cpus in the presence of runnable tasks is considered too fragile, easy to
      break with outside code, and the complexity of managing this system if an
      architecture comes along with many logical cores sharing cpu power will be
      unworkable.
      
      Remove the associated per_cpu_gain variable in sched_domains used only by
      this code.
      
      Also:
      
        The reason is that with dynticks enabled, this code breaks without yet
        further tweaks so dynticks brought on the rapid demise of this code.  So
        either we tweak this code or kill it off entirely.  It was Ingo's preference
        to kill it off.  Either way this needs to happen for 2.6.21 since dynticks
        has gone in.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69f7c0a1
  9. 17 2月, 2007 1 次提交
  10. 13 2月, 2007 2 次提交
  11. 12 2月, 2007 1 次提交
  12. 14 12月, 2006 1 次提交
    • R
      [PATCH] PM: Fix SMP races in the freezer · 8a102eed
      Rafael J. Wysocki 提交于
      Currently, to tell a task that it should go to the refrigerator, we set the
      PF_FREEZE flag for it and send a fake signal to it.  Unfortunately there
      are two SMP-related problems with this approach.  First, a task running on
      another CPU may be updating its flags while the freezer attempts to set
      PF_FREEZE for it and this may leave the task's flags in an inconsistent
      state.  Second, there is a potential race between freeze_process() and
      refrigerator() in which freeze_process() running on one CPU is reading a
      task's PF_FREEZE flag while refrigerator() running on another CPU has just
      set PF_FROZEN for the same task and attempts to reset PF_FREEZE for it.  If
      the refrigerator wins the race, freeze_process() will state that PF_FREEZE
      hasn't been set for the task and will set it unnecessarily, so the task
      will go to the refrigerator once again after it's been thawed.
      
      To solve first of these problems we need to stop using PF_FREEZE to tell
      tasks that they should go to the refrigerator.  Instead, we can introduce a
      special TIF_*** flag and use it for this purpose, since it is allowed to
      change the other tasks' TIF_*** flags and there are special calls for it.
      
      To avoid the freeze_process()-refrigerator() race we can make
      freeze_process() to always check the task's PF_FROZEN flag after it's read
      its "freeze" flag.  We should also make sure that refrigerator() will
      always reset the task's "freeze" flag after it's set PF_FROZEN for it.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8a102eed
  13. 11 12月, 2006 4 次提交
    • C
      [PATCH] sched: remove lb_stopbalance counter · 06066714
      Chen, Kenneth W 提交于
      Remove scheduler stats lb_stopbalance counter.  This counter can be
      calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
      create gazillion counters while we can derive the value.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      06066714
    • S
      [PATCH] sched: decrease number of load balances · 783609c6
      Siddha, Suresh B 提交于
      Currently at a particular domain, each cpu in the sched group will do a
      load balance at the frequency of balance_interval.  More the cores and
      threads, more the cpus will be in each sched group at SMP and NUMA domain.
      And we endup spending quite a bit of time doing load balancing in those
      domains.
      
      Fix this by making only one cpu(first idle cpu or first cpu in the group if
      all the cpus are busy) in the sched group do the load balance at that
      particular sched domain and this load will slowly percolate down to the
      other cpus with in that group(when they do load balancing at lower
      domains).
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      783609c6
    • C
      [PATCH] sched: add option to serialize load balancing · 08c183f3
      Christoph Lameter 提交于
      Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
      to the sched domain flags.  If that flag is set then we make sure that no
      other such domain is being balanced.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08c183f3
    • A
      [PATCH] io-accounting: core statistics · 7c3ab738
      Andrew Morton 提交于
      The present per-task IO accounting isn't very useful.  It simply counts the
      number of bytes passed into read() and write().  So if a process reads 1MB
      from an already-cached file, it is accused of having performed 1MB of I/O,
      which is wrong.
      
      (David Wright had some comments on the applicability of the present logical IO accounting:
      
        For billing purposes it is useless but for workload analysis it is very
        useful
      
        read_bytes/read_calls  average read request size
        write_bytes/write_calls average write request size
      
        read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
        write_bytes/write_blocks  ie logical/physical  guess since pdflush writes can
                                                      be missed
      
        I often look for logical larger than physical to see filesystem cache
        problems.  And the bytes/cpusec can help find applications that are
        dominating the cache and causing slow interactive response from page cache
        contention.
      
        I want to find the IO intensive applications and make sure they are doing
        efficient IO.  Thus the acctcms(sysV) or csacms command would give the high
        IO commands).
      
      This patchset adds new accounting which tries to be more accurate.  We account
      for three things:
      
      reads:
      
        attempt to count the number of bytes which this process really did cause
        to be fetched from the storage layer.  Done at the submit_bio() level, so it
        is accurate for block-backed filesystems.  I also attempt to wire up NFS and
        CIFS.
      
      writes:
      
        attempt to count the number of bytes which this process caused to be sent
        to the storage layer.  This is done at page-dirtying time.
      
        The big inaccuracy here is truncate.  If a process writes 1MB to a file
        and then deletes the file, it will in fact perform no writeout.  But it will
        have been accounted as having caused 1MB of write.
      
        So...
      
      cancelled_writes:
      
        account the number of bytes which this process caused to not happen, by
        truncating pagecache.
      
        We _could_ just subtract this from the process's `write' accounting.  But
        that means that some processes would be reported to have done negative
        amounts of write IO, which is silly.
      
        So we just report the raw number and punt this decision up to userspace.
      
      Now, we _could_ account for writes at the physical I/O level.  But
      
      - This would require that we track memory-dirtying tasks at the per-page
        level (would require a new pointer in struct page).
      
      - It would mean that IO statistics for a process are usually only available
        long after that process has exitted.  Which means that we probably cannot
        communicate this info via taskstats.
      
      This patch:
      
      Wire up the kernel-private data structures and the accessor functions to
      manipulate them.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7c3ab738
  14. 09 12月, 2006 5 次提交
  15. 08 12月, 2006 3 次提交