1. 25 2月, 2009 3 次提交
    • P
      generic-ipi: remove CSD_FLAG_WAIT · 6e275637
      Peter Zijlstra 提交于
      Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework
      the code so that we can use CSD_FLAG_LOCK for both purposes.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6e275637
    • P
      generic-ipi: remove kmalloc() · 8969a5ed
      Peter Zijlstra 提交于
      Remove the use of kmalloc() from the smp_call_function_*()
      calls.
      
      Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu
      data for single cpu ipi calls) started the discussion on the use
      of kmalloc() in this code and fixed the
      smp_call_function_single(.wait=0) fallback case.
      
      In this patch we complete this by also providing means for the
      _many() call, which fully removes the need for kmalloc() in this
      code.
      
      The problem with the _many() call is that other cpus might still
      be observing our entry when we're done with it. It solved this
      by dynamically allocating data elements and RCU-freeing it.
      
      We solve it by using a single per-cpu entry which provides
      static storage and solves one half of the problem (avoiding
      referencing freed data).
      
      The other half, ensuring the queue iteration it still possible,
      is done by placing re-used entries at the head of the list. This
      means that if someone was still iterating that entry when it got
      moved, he will now re-visit the entries on the list he had
      already seen, but avoids skipping over entries like would have
      happened had we placed the new entry at the end.
      
      Furthermore, visiting entries twice is not a problem, since we
      remove our cpu from the entry's cpumask once its called.
      
      Many thanks to Oleg for his suggestions and him poking holes in
      my earlier attempts.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8969a5ed
    • N
      generic IPI: simplify barriers and locking · 15d0d3b3
      Nick Piggin 提交于
      Simplify the barriers in generic remote function call interrupt
      code.
      
      Firstly, just unconditionally take the lock and check the list
      in the generic_call_function_single_interrupt IPI handler. As
      we've just taken an IPI here, the chances are fairly high that
      there will be work on the list for us, so do the locking
      unconditionally. This removes the tricky lockless list_empty
      check and dubious barriers. The change looks bigger than it is
      because it is just removing an outer loop.
      
      Secondly, clarify architecture specific IPI locking rules.
      Generic code has no tools to impose any sane ordering on IPIs if
      they go outside normal cache coherency, ergo the arch code must
      make them appear to obey cache coherency as a "memory operation"
      to initiate an IPI, and a "memory operation" to receive one.
      This way at least they can be reasoned about in generic code,
      and smp_mb used to provide ordering.
      
      The combination of these two changes means that explict barriers
      can be taken out of queue handling for the single case -- shared
      data is explicitly locked, and ipi ordering must conform to
      that, so no barriers needed. An extra barrier is needed in the
      many handler, so as to ensure we load the list element after the
      IPI is received.
      
      Does any architecture actually *need* these barriers? For the
      initiator I could see it, but for the handler I would be
      surprised. So the other thing we could do for simplicity is just
      to require that, rather than just matching with cache coherency,
      we just require a full barrier before generating an IPI, and
      after receiving an IPI. In which case, the smp_mb()s can go
      away. But just for now, we'll be on the safe side and use the
      barriers (they're in the slow case anyway).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: linux-arch@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      15d0d3b3
  2. 23 2月, 2009 1 次提交
  3. 22 2月, 2009 5 次提交
  4. 19 2月, 2009 5 次提交
  5. 18 2月, 2009 2 次提交
    • J
      block: fix bad definition of BIO_RW_SYNC · 93dbb393
      Jens Axboe 提交于
      We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
      and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
      213d9417.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      93dbb393
    • F
      tracing/function-graph-tracer: trace the idle tasks · 5b058bcd
      Frederic Weisbecker 提交于
      When the function graph tracer is activated, it iterates over the task_list
      to allocate a stack to store the return addresses.
      
      But the per cpu idle tasks are not iterated by using
      do_each_thread / while_each_thread.
      
      So we have to iterate on them manually.
      
      This fixes somes weirdness in the traces and many losses of traces.
      Examples on two cpus:
      
       0)   Xorg-4287    |   2.906 us    |              }
       0)   Xorg-4287    |   3.965 us    |            }
       0)   Xorg-4287    |   5.302 us    |          }
       ------------------------------------------
       0)   Xorg-4287    =>    <idle>-0
       ------------------------------------------
      
       0)    <idle>-0    |   2.861 us    |                        }
       0)    <idle>-0    |   0.526 us    |                        set_normalized_timespec();
       0)    <idle>-0    |   7.201 us    |                      }
       0)    <idle>-0    |   8.214 us    |                    }
       0)    <idle>-0    |               |                    clockevents_program_event() {
       0)    <idle>-0    |               |                      lapic_next_event() {
       0)    <idle>-0    |   0.510 us    |                        native_apic_mem_write();
       0)    <idle>-0    |   1.546 us    |                      }
       0)    <idle>-0    |   2.583 us    |                    }
       0)    <idle>-0    | + 12.435 us   |                  }
       0)    <idle>-0    | + 13.470 us   |                }
       0)    <idle>-0    |   0.608 us    |                _spin_unlock_irqrestore();
       0)    <idle>-0    | + 23.270 us   |              }
       0)    <idle>-0    | + 24.336 us   |            }
       0)    <idle>-0    | + 25.417 us   |          }
       0)    <idle>-0    |   0.593 us    |          _spin_unlock();
       0)    <idle>-0    | + 41.869 us   |        }
       0)    <idle>-0    | + 42.906 us   |      }
       0)    <idle>-0    | + 95.035 us   |    }
       0)    <idle>-0    |   0.540 us    |    menu_reflect();
       0)    <idle>-0    | ! 100.404 us  |  }
       0)    <idle>-0    |   0.564 us    |  mce_idle_callback();
       0)    <idle>-0    |               |  enter_idle() {
       0)    <idle>-0    |   0.526 us    |    mce_idle_callback();
       0)    <idle>-0    |   1.757 us    |  }
       0)    <idle>-0    |               |  cpuidle_idle_call() {
       0)    <idle>-0    |               |    menu_select() {
       0)    <idle>-0    |   0.525 us    |      pm_qos_requirement();
       0)    <idle>-0    |   0.518 us    |      tick_nohz_get_sleep_length();
       0)    <idle>-0    |   2.621 us    |    }
      [...]
       1)    <idle>-0    |   0.518 us    |              touch_softlockup_watchdog();
       1)    <idle>-0    | + 14.355 us   |            }
       1)    <idle>-0    | + 22.840 us   |          }
       1)    <idle>-0    | + 25.949 us   |        }
       1)    <idle>-0    |               |        handle_irq() {
       1)    <idle>-0    |   0.511 us    |          irq_to_desc();
       1)    <idle>-0    |               |          handle_edge_irq() {
       1)    <idle>-0    |   0.638 us    |            _spin_lock();
       1)    <idle>-0    |               |            ack_apic_edge() {
       1)    <idle>-0    |   0.510 us    |              irq_to_desc();
       1)    <idle>-0    |               |              move_native_irq() {
       1)    <idle>-0    |   0.510 us    |                irq_to_desc();
       1)    <idle>-0    |   1.532 us    |              }
       1)    <idle>-0    |   0.511 us    |              native_apic_mem_write();
       ------------------------------------------
       1)    <idle>-0    =>    cat-5073
       ------------------------------------------
      
       1)    cat-5073    |   3.731 us    |                    }
       1)    cat-5073    |               |                    run_local_timers() {
       1)    cat-5073    |   0.533 us    |                      hrtimer_run_queues();
       1)    cat-5073    |               |                      raise_softirq() {
       1)    cat-5073    |               |                        __raise_softirq_irqoff() {
       1)    cat-5073    |               |                          /* nr: 1 */
       1)    cat-5073    |   2.718 us    |                        }
       1)    cat-5073    |   3.814 us    |                      }
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5b058bcd
  6. 16 2月, 2009 2 次提交
  7. 14 2月, 2009 1 次提交
  8. 13 2月, 2009 1 次提交
    • P
      timers: more consistently use clock vs timer · 3997ad31
      Peter Zijlstra 提交于
      While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
      
      Make sure that all timer functions call cpu_timer_sample_group() and not
      cpu_clock_sample_group(). This ensures that we enable the process wide timer
      in time, and therefore pay the O(n) thread group cost from the syscall.
      
      Not doing it here, will result in the first jiffy tick after setting the timer
      doing this, resulting in a very expensive tick (but only once) and a delay in
      actually starting the timer.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3997ad31
  9. 12 2月, 2009 4 次提交
  10. 11 2月, 2009 4 次提交
  11. 10 2月, 2009 1 次提交
  12. 09 2月, 2009 5 次提交
  13. 07 2月, 2009 1 次提交
  14. 06 2月, 2009 3 次提交
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
    • A
      revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY" · 60fd760f
      Andrew Morton 提交于
      Revert commit 0c2d64fb because it causes
      (arguably poorly designed) existing userspace to spend interminable
      periods closing billions of not-open file descriptors.
      
      We could bring this back, with some sort of opt-in tunable in /proc, which
      defaults to "off".
      
      Peter's alanysis follows:
      
      : I spent several hours trying to get to the bottom of a serious
      : performance issue that appeared on one of our servers after upgrading to
      : 2.6.28.  In the end it's what could be considered a userspace bug that
      : was triggered by a change in 2.6.28.  Since this might also affect other
      : people I figured I'd at least document what I found here, and maybe we
      : can even do something about it:
      :
      :
      : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
      : the team maintaining our ftp archive complained that one of their
      : scripts that previously ran in a few minutes still hadn't even come
      : close to being done after an hour or so.  Downgrading to 2.6.27 fixed
      : that.
      :
      : Turns out that script is forking a lot and something in it or python or
      : whereever closes all the file descriptors it doesn't want to pass on.
      : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
      : closes them all with a few exceptions.
      :
      : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
      :
      : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
      : now a thousand times that.
      :
      : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
      : RLIM_INFINITY" (0c2d64fb)[1] that
      : allows, as the title implies, to set the limit for number of files to
      : infinity.
      :
      : Closer investigation showed that the broken default ulimit did not apply
      : to "system" processes (like stuff started from init).  In the end I
      : could establish that all processes that passed through pam_limit at one
      : point had the bad resource limit.
      :
      : Apparently the pam library in Debian etch (4.0) initializes the limits
      : to some default values when it doesn't have any settings in limit.conf
      : to override them.  Turns out that for nofiles this is RLIM_INFINITY.
      : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
      : package version 0.79-5 fixes that - tho I'm not sure what side effects
      : that has.
      :
      : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
      : uses a different pam (version).
      Reported-by: NPeter Palfrader <weasel@debian.org>
      Cc: Adam Tkac <vonsch@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fd760f
    • A
      kernel/async.c: fix printk warnings · 58763a29
      Andrew Morton 提交于
      alpha:
      
      kernel/async.c: In function 'run_one_entry':
      kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64'
      kernel/async.c: In function 'async_synchronize_cookie_special':
      kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64'
      
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58763a29
  15. 05 2月, 2009 2 次提交
    • P
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra 提交于
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd4c1b4
    • P
      signal: re-add dead task accumulation stats. · 32bd671d
      Peter Zijlstra 提交于
      We're going to split the process wide cpu accounting into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context.
      
       - timers; which need constant time tracing but can affort the overhead
                 because they're default off -- and rare.
      
      The clock readout will go back to a full sum of the thread group, for this
      we need to re-add the exit stats that were removed in the initial itimer
      rework (f06febc9: timers: fix itimer/many thread hang).
      
      Furthermore, since that full sum can be rather slow for large thread groups
      and we have the complete dead task stats, revert the do_notify_parent time
      computation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      32bd671d
反馈
建议
客服 返回
顶部