1. 13 10月, 2010 1 次提交
    • S
      ring-buffer: Fix typo of time extends per page · d0134324
      Steven Rostedt 提交于
      Time stamps for the ring buffer are created by the difference between
      two events. Each page of the ring buffer holds a full 64 bit timestamp.
      Each event has a 27 bit delta stamp from the last event. The unit of time
      is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
      happen more than 134 milliseconds apart, a time extend is inserted
      to add more bits for the delta. The time extend has 59 bits, which
      is good for ~18 years.
      
      Currently the time extend is committed separately from the event.
      If an event is discarded before it is committed, due to filtering,
      the time extend still exists. If all events are being filtered, then
      after ~134 milliseconds a new time extend will be added to the buffer.
      
      This can only happen till the end of the page. Since each page holds
      a full timestamp, there is no reason to add a time extend to the
      beginning of a page. Time extends can only fill a page that has actual
      data at the beginning, so there is no fear that time extends will fill
      more than a page without any data.
      
      When reading an event, a loop is made to skip over time extends
      since they are only used to maintain the time stamp and are never
      given to the caller. As a paranoid check to prevent the loop running
      forever, with the knowledge that time extends may only fill a page,
      a check is made that tests the iteration of the loop, and if the
      iteration is more than the number of time extends that can fit in a page
      a warning is printed and the ring buffer is disabled (all of ftrace
      is also disabled with it).
      
      There is another event type that is called a TIMESTAMP which can
      hold 64 bits of data in the theoretical case that two events happen
      18 years apart. This code has not been implemented, but the name
      of this event exists, as well as the structure for it. The
      size of a TIMESTAMP is 16 bytes, where as a time extend is only
      8 bytes. The macro used to calculate how many time extends can fit on
      a page used the TIMESTAMP size instead of the time extend size
      cutting the amount in half.
      
      The following test case can easily trigger the warning since we only
      need to have half the page filled with time extends to trigger the
      warning:
      
       # cd /sys/kernel/debug/tracing/
       # echo function > current_tracer
       # echo 'common_pid < 0' > events/ftrace/function/filter
       # echo > trace
       # echo 1 > trace_marker
       # sleep 120
       # cat trace
      
      Enabling the function tracer and then setting the filter to only trace
      functions where the process id is negative (no events), then clearing
      the trace buffer to ensure that we have nothing in the buffer,
      then write to trace_marker to add an event to the beginning of a page,
      sleep for 2 minutes (only 35 seconds is probably needed, but this
      guarantees the bug), and then finally reading the trace which will
      trigger the bug.
      
      This patch fixes the typo and prevents the false positive of that warning.
      Reported-by: NHans J. Koch <hjk@linutronix.de>
      Tested-by: NHans J. Koch <hjk@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Stable Kernel <stable@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d0134324
  2. 12 10月, 2010 1 次提交
  3. 08 10月, 2010 1 次提交
  4. 07 10月, 2010 1 次提交
    • A
      HWPOISON: Copy si_addr_lsb to user · a337fdac
      Andi Kleen 提交于
      The original hwpoison code added a new siginfo field si_addr_lsb to
      pass the granuality of the fault address to user space. Unfortunately
      this field was never copied to user space. Fix this here.
      
      I added explicit checks for the MCEERR codes to avoid having
      to patch all potential callers to initialize the field.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a337fdac
  5. 06 10月, 2010 1 次提交
    • L
      modules: Fix module_bug_list list corruption race · 5336377d
      Linus Torvalds 提交于
      With all the recent module loading cleanups, we've minimized the code
      that sits under module_mutex, fixing various deadlocks and making it
      possible to do most of the module loading in parallel.
      
      However, that whole conversion totally missed the rather obscure code
      that adds a new module to the list for BUG() handling.  That code was
      doubly obscure because (a) the code itself lives in lib/bugs.c (for
      dubious reasons) and (b) it gets called from the architecture-specific
      "module_finalize()" rather than from generic code.
      
      Calling it from arch-specific code makes no sense what-so-ever to begin
      with, and is now actively wrong since that code isn't protected by the
      module loading lock any more.
      
      So this commit moves the "module_bug_{finalize,cleanup}()" calls away
      from the arch-specific code, and into the generic code - and in the
      process protects it with the module_mutex so that the list operations
      are now safe.
      
      Future fixups:
       - move the module list handling code into kernel/module.c where it
         belongs.
       - get rid of 'module_bug_list' and just use the regular list of modules
         (called 'modules' - imagine that) that we already create and maintain
         for other reasons.
      Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5336377d
  6. 02 10月, 2010 1 次提交
    • I
      kfifo: fix scatterlist usage · 399f1e30
      Ira W. Snyder 提交于
      The kfifo_dma family of functions use sg_mark_end() on the last element in
      their scatterlist.  This forces use of a fresh scatterlist for each DMA
      operation, which makes recycling a single scatterlist impossible.
      
      Change the behavior of the kfifo_dma functions to match the usage of the
      dma_map_sg function.  This means that users must respect the returned
      nents value.  The sample code is updated to reflect the change.
      
      This bug is trivial to cause: call kfifo_dma_in_prepare() such that it
      prepares a scatterlist with a single entry comprising the whole fifo.
      This is the case when you map the entirety of a newly created empty fifo.
      This causes the setup_sgl() function to mark the first scatterlist entry
      as the end of the chain, no matter what comes after it.
      
      Afterwards, add and remove some data from the fifo such that another call
      to kfifo_dma_in_prepare() will create two scatterlist entries.  It returns
      nents=2.  However, due to the previous sg_mark_end() call, sg_is_last()
      will now return true for the first scatterlist element.  This causes the
      sample code to print a single scatterlist element when it should print
      two.
      
      By removing the call to sg_mark_end(), we make the API as similar as
      possible to the DMA mapping API.  All users are required to respect the
      returned nents.
      Signed-off-by: NIra W. Snyder <iws@ovro.caltech.edu>
      Cc: Stefani Seibold <stefani@seibold.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      399f1e30
  7. 23 9月, 2010 1 次提交
  8. 21 9月, 2010 5 次提交
    • S
      tracing/sched: Add sched_pi_setprio tracepoint · a8027073
      Steven Rostedt 提交于
      Add a tracepoint that shows the priority of a task being boosted
      via priority inheritance.
      
      Cc: Gregory Haskins <ghaskins@novell.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a8027073
    • S
      sched: Give CPU bound RT tasks preference · b3bc211c
      Steven Rostedt 提交于
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b3bc211c
    • S
      sched: Try not to migrate higher priority RT tasks · 43fa5460
      Steven Rostedt 提交于
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43fa5460
    • V
      sched: Increment cache_nice_tries only on periodic lb · 58b26c4c
      Venkatesh Pallipadi 提交于
      scheduler uses cache_nice_tries as an indicator to do cache_hot and
      active load balance, when normal load balance fails. Currently,
      this value is changed on any failed load balance attempt. That ends
      up being not so nice to workloads that enter/exit idle often, as
      they do more frequent new_idle balance and that pretty soon results
      in cache hot tasks being pulled in.
      
      Making the cache_nice_tries ignore failed new_idle balance seems to
      make better sense. With that only the failed load balance in
      periodic load balance gets accounted and the rate of accumulation
      of cache_nice_tries will not depend on idle entry/exit (short
      running sleep-wakeup kind of tasks). This reduces movement of
      cache_hot tasks.
      
      schedstat diff (after-before) excerpt from a workload that has
      frequent and short wakeup-idle pattern (:2 in cpu col below refers
      to NEWIDLE idx) This snapshot was across ~400 seconds.
      
      Without this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  306487   219575    73167  110069413    44583    19070     1172   218403
       1:2  292139   194853    81421  120893383    50745    21902     1259   193594
       2:2  283166   174607    91359  129699642    54931    23688     1287   173320
       3:2  273998   161788    93991  132757146    57122    24351     1366   160422
       4:2  289851   215692    62190  83398383    36377    13680      851   214841
       5:2  316312   222146    77605  117582154    49948    20281      988   221158
       6:2  297172   195596    83623  122133390    52801    21301      929   194667
       7:2  283391   178078    86378  126622761    55122    22239      928   177150
       8:2  297655   210359    72995  110246694    45798    19777     1125   209234
       9:2  297357   202011    79363  119753474    50953    22088     1089   200922
      10:2  278797   178703    83180  122514385    52969    22726     1128   177575
      11:2  272661   167669    86978  127342327    55857    24342     1195   166474
      12:2  293039   204031    73211  110282059    47285    19651      948   203083
      13:2  289502   196762    76803  114712942    49339    20547     1016   195746
      14:2  264446   169609    78292  115715605    50459    21017      982   168627
      15:2  260968   163660    80142  116811793    51483    21281     1064   162596
      
      With this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  272347   187380    77455  105420270    24975        1      953   186427
       1:2  267276   172360    86234  116242264    28087        6     1028   171332
       2:2  259769   156777    93281  123243134    30555        1     1043   155734
       3:2  250870   143129    97627  127370868    32026        6     1188   141941
       4:2  248422   177116    64096  78261112    22202        2      757   176359
       5:2  275595   180683    84950  116075022    29400        6      778   179905
       6:2  262418   162609    88944  119256898    31056        4      817   161792
       7:2  252204   147946    92646  122388300    32879        4      824   147122
       8:2  262335   172239    81631  110477214    26599        4      864   171375
       9:2  261563   164775    88016  117203621    28331        3      849   163926
      10:2  243389   140949    93379  121353071    29585        2      909   140040
      11:2  242795   134651    98310  124768957    30895        2     1016   133635
      12:2  255234   166622    79843  104696912    26483        4      746   165876
      13:2  244944   151595    83855  109808099    27787        3      801   150794
      14:2  241301   140982    89935  116954383    30403        6      845   140137
      15:2  232271   128564    92821  119185207    31207        4     1416   127148
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58b26c4c
    • S
      sched: Fix nohz balance kick · f6c3f168
      Suresh Siddha 提交于
      There's a situation where the nohz balancer will try to wake itself:
      
      cpu-x is idle which is also ilb_cpu
      got a scheduler tick during idle
      and the nohz_kick_needed() in trigger_load_balance() checks for
      rq_x->nr_running which might not be zero (because of someone waking a
      task on this rq etc) and this leads to the situation of the cpu-x
      sending a kick to itself.
      
      And this can cause a lockup.
      
      Avoid this by not marking ourself eligible for kicking.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6c3f168
  9. 17 9月, 2010 1 次提交
  10. 16 9月, 2010 1 次提交
    • H
      sched: Remove branch hints within context_switch() · 31915ab4
      Heiko Carstens 提交于
      With 710390d9 "sched: Optimize branch hint in context_switch()"
      the branch hint logic within context_switch() got inversed.
      
      In fact the hints "if (likely(!mm))" and "if (likely(!prev->mm))"
      mean that it is likely that the previous and next task are kernel
      threads.
      
      That assumption is certainly counter intuitive, but Tim has shown
      that at least with his workload this is true. Nevertheless the
      truth is: it depends on the current workload. So just remove the
      annotations which also improves readability.
      Reported-by: NTim Blechmann <tim@klingt.org>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20100916124225.GA2209@osiris.boeblingen.de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      31915ab4
  11. 15 9月, 2010 2 次提交
  12. 14 9月, 2010 2 次提交
  13. 13 9月, 2010 1 次提交
  14. 12 9月, 2010 1 次提交
    • R
      PM / Hibernate: Avoid hitting OOM during preallocation of memory · 6715045d
      Rafael J. Wysocki 提交于
      There is a problem in hibernate_preallocate_memory() that it calls
      preallocate_image_memory() with an argument that may be greater than
      the total number of available non-highmem memory pages.  If that's
      the case, the OOM condition is guaranteed to trigger, which in turn
      can cause significant slowdown to occur during hibernation.
      
      To avoid that, make preallocate_image_memory() adjust its argument
      before calling preallocate_image_pages(), so that the total number of
      saveable non-highem pages left is not less than the minimum size of
      a hibernation image.  Change hibernate_preallocate_memory() to try to
      allocate from highmem if the number of pages allocated by
      preallocate_image_memory() is too low.
      
      Modify free_unnecessary_pages() to take all possible memory
      allocation patterns into account.
      Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Tested-by: NM. Vefa Bicakci <bicave@superonline.com>
      6715045d
  15. 11 9月, 2010 1 次提交
  16. 10 9月, 2010 11 次提交
  17. 08 9月, 2010 3 次提交
  18. 09 9月, 2010 1 次提交
    • S
      tracing: Do not allow llseek to set_ftrace_filter · 9c55cb12
      Steven Rostedt 提交于
      Reading the file set_ftrace_filter does three things.
      
      1) shows whether or not filters are set for the function tracer
      2) shows what functions are set for the function tracer
      3) shows what triggers are set on any functions
      
      3 is independent from 1 and 2.
      
      The way this file currently works is that it is a state machine,
      and as you read it, it may change state. But this assumption breaks
      when you use lseek() on the file. The state machine gets out of sync
      and the t_show() may use the wrong pointer and cause a kernel oops.
      
      Luckily, this will only kill the app that does the lseek, but the app
      dies while holding a mutex. This prevents anyone else from using the
      set_ftrace_filter file (or any other function tracing file for that matter).
      
      A real fix for this is to rewrite the code, but that is too much for
      a -rc release or stable. This patch simply disables llseek on the
      set_ftrace_filter() file for now, and we can do the proper fix for the
      next major release.
      Reported-by: NRobert Swiecki <swiecki@google.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Eugene Teo <eugene@redhat.com>
      Cc: vendor-sec@lst.de
      Cc: <stable@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9c55cb12
  19. 08 9月, 2010 1 次提交
  20. 05 9月, 2010 2 次提交
  21. 03 9月, 2010 1 次提交