1. 24 12月, 2010 1 次提交
  2. 23 12月, 2010 1 次提交
    • J
      taskstats: pad taskstats netlink response for aligment issues on ia64 · 4be2c95d
      Jeff Mahoney 提交于
      The taskstats structure is internally aligned on 8 byte boundaries but the
      layout of the aggregrate reply, with two NLA headers and the pid (each 4
      bytes), actually force the entire structure to be unaligned.  This causes
      the kernel to issue unaligned access warnings on some architectures like
      ia64.  Unfortunately, some software out there doesn't properly unroll the
      NLA packet and assumes that the start of the taskstats structure will
      always be 20 bytes from the start of the netlink payload.  Aligning the
      start of the taskstats structure breaks this software, which we don't
      want.  So, for now the alignment only happens on architectures that
      require it and those users will have to update to fixed versions of those
      packages.  Space is reserved in the packet only when needed.  This ifdef
      should be removed in several years e.g.  2012 once we can be confident
      that fixed versions are installed on most systems.  We add the padding
      before the aggregate since the aggregate is already a defined type.
      
      Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems")
      previously addressed the alignment issues by padding out the pid field.
      This was supposed to be a compatible change but the circumstances
      described above mean that it wasn't.  This patch backs out that change,
      since it was a hack, and introduces a new NULL attribute type to provide
      the padding.  Padding the response with 4 bytes avoids allocating an
      aligned taskstats structure and copying it back.  Since the structure
      weighs in at 328 bytes, it's too big to do it on the stack.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NBrian Rogers <brian@xyzw.org>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Guillaume Chazarain <guichaz@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be2c95d
  3. 20 12月, 2010 1 次提交
  4. 18 12月, 2010 2 次提交
  5. 17 12月, 2010 2 次提交
  6. 16 12月, 2010 3 次提交
  7. 14 12月, 2010 1 次提交
    • S
      workqueue: It is likely that WORKER_NOT_RUNNING is true · 2d64672e
      Steven Rostedt 提交于
      Running the annotate branch profiler on three boxes, including my
      main box that runs firefox, evolution, xchat, and is part of the distcc farm,
      showed this with the likelys in the workqueue code:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
            96   996253  99 wq_worker_sleeping             workqueue.c          703
            96   996247  99 wq_worker_waking_up            workqueue.c          677
      
      The likely()s in this case were assuming that WORKER_NOT_RUNNING will
      most likely be false. But this is not the case. The reason is
      (and shown by adding trace_printks and testing it) that most of the time
      WORKER_PREP is set.
      
      In worker_thread() we have:
      
      	worker_clr_flags(worker, WORKER_PREP);
      
      	[ do work stuff ]
      
      	worker_set_flags(worker, WORKER_PREP, false);
      
      (that 'false' means not to wake up an idle worker)
      
      The wq_worker_sleeping() is called from schedule when a worker thread
      is putting itself to sleep. Which happens most of the time outside
      of that [ do work stuff ].
      
      The wq_worker_waking_up is called by the wakeup worker code, which
      is also callod outside that [ do work stuff ].
      
      Thus, the likely and unlikely used by those two functions are actually
      backwards.
      
      Remove the annotation and let gcc figure it out.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2d64672e
  8. 09 12月, 2010 4 次提交
    • H
      nohz: Fix get_next_timer_interrupt() vs cpu hotplug · dbd87b5a
      Heiko Carstens 提交于
      This fixes a bug as seen on 2.6.32 based kernels where timers got
      enqueued on offline cpus.
      
      If a cpu goes offline it might still have pending timers. These will
      be migrated during CPU_DEAD handling after the cpu is offline.
      However while the cpu is going offline it will schedule the idle task
      which will then call tick_nohz_stop_sched_tick().
      
      That function in turn will call get_next_timer_intterupt() to figure
      out if the tick of the cpu can be stopped or not. If it turns out that
      the next tick is just one jiffy off (delta_jiffies == 1)
      tick_nohz_stop_sched_tick() incorrectly assumes that the tick should
      not stop and takes an early exit and thus it won't update the load
      balancer cpu.
      
      Just afterwards the cpu will be killed and the load balancer cpu could
      be the offline cpu.
      
      On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide
      on which cpu a timer should be enqueued (see __mod_timer()). Which
      leads to the possibility that timers get enqueued on an offline cpu.
      These will never expire and can cause a system hang.
      
      This has been observed 2.6.32 kernels. On current kernels
      __mod_timer() uses get_nohz_timer_target() which doesn't have that
      problem. However there might be other problems because of the too
      early exit tick_nohz_stop_sched_tick() in case a cpu goes offline.
      
      The easiest and probably safest fix seems to be to let
      get_next_timer_interrupt() just lie and let it say there isn't any
      pending timer if the current cpu is offline.
      
      I also thought of moving migrate_[hr]timers() from CPU_DEAD to
      CPU_DYING, but seeing that there already have been fixes at least in
      the hrtimer code in this area I'm afraid that this could add new
      subtle bugs.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dbd87b5a
    • M
      Sched: fix skip_clock_update optimization · f26f9aff
      Mike Galbraith 提交于
      idle_balance() drops/retakes rq->lock, leaving the previous task
      vulnerable to set_tsk_need_resched().  Clear it after we return
      from balancing instead, and in setup_thread_stack() as well, so
      no successfully descheduled or never scheduled task has it set.
      
      Need resched confused the skip_clock_update logic, which assumes
      that the next call to update_rq_clock() will come nearly immediately
      after being set.  Make the optimization robust against the waking
      a sleeper before it sucessfully deschedules case by checking that
      the current task has not been dequeued before setting the flag,
      since it is that useless clock update we're trying to save, and
      clear unconditionally in schedule() proper instead of conditionally
      in put_prev_task().
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Reported-by: NBjoern B. Brandenburg <bbb.lst@gmail.com>
      Tested-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f26f9aff
    • P
      sched: Cure more NO_HZ load average woes · 0f004f5a
      Peter Zijlstra 提交于
      There's a long-running regression that proved difficult to fix and
      which is hitting certain people and is rather annoying in its effects.
      
      Damien reported that after 74f5187a (sched: Cure load average vs
      NO_HZ woes) his load average is unnaturally high, he also noted that
      even with that patch reverted the load avgerage numbers are not
      correct.
      
      The problem is that the previous patch only solved half the NO_HZ
      problem, it addressed the part of going into NO_HZ mode, not of
      comming out of NO_HZ mode. This patch implements that missing half.
      
      When comming out of NO_HZ mode there are two important things to take
      care of:
      
       - Folding the pending idle delta into the global active count.
       - Correctly aging the averages for the idle-duration.
      
      So with this patch the NO_HZ interaction should be complete and
      behaviour between CONFIG_NO_HZ=[yn] should be equivalent.
      
      Furthermore, this patch slightly changes the load average computation
      by adding a rounding term to the fixed point multiplication.
      Reported-by: NDamien Wyart <damien.wyart@free.fr>
      Reported-by: NTim McGrath <tmhikaru@gmail.com>
      Tested-by: NDamien Wyart <damien.wyart@free.fr>
      Tested-by: NOrion Poplawski <orion@cora.nwra.com>
      Tested-by: NKyle McMartin <kyle@mcmartin.ca>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Cc: Chase Douglas <chase.douglas@canonical.com>
      LKML-Reference: <1291129145.32004.874.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0f004f5a
    • P
      perf: Fix duplicate events with multiple-pmu vs software events · 51676957
      Peter Zijlstra 提交于
      Because the multi-pmu bits can share contexts between struct pmu
      instances we could get duplicate events by iterating the pmu list.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      51676957
  9. 07 12月, 2010 2 次提交
    • R
      PM / Hibernate: Fix memory corruption related to swap · c9e664f1
      Rafael J. Wysocki 提交于
      There is a problem that swap pages allocated before the creation of
      a hibernation image can be released and used for storing the contents
      of different memory pages while the image is being saved.  Since the
      kernel stored in the image doesn't know of that, it causes memory
      corruption to occur after resume from hibernation, especially on
      systems with relatively small RAM that need to swap often.
      
      This issue can be addressed by keeping the GFP_IOFS bits clear
      in gfp_allowed_mask during the entire hibernation, including the
      saving of the image, until the system is finally turned off or
      the hibernation is aborted.  Unfortunately, for this purpose
      it's necessary to rework the way in which the hibernate and
      suspend code manipulates gfp_allowed_mask.
      
      This change is based on an earlier patch from Hugh Dickins.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      c9e664f1
    • B
      PM / Hibernate: Use async I/O when reading compressed hibernation image · 9f339caf
      Bojan Smojver 提交于
      This is a fix for reading LZO compressed image using async I/O.
      Essentially, instead of having just one page into which we keep
      reading blocks from swap, we allocate enough of them to cover the
      largest compressed size and then let block I/O pick them all up. Once
      we have them all (and here we wait), we decompress them, as usual.
      Obviously, the very first block we still pick up synchronously,
      because we need to know the size of the lot before we pick up the
      rest.
      
      Also fixed the copyright line, which I've forgotten before.
      Signed-off-by: NBojan Smojver <bojan@rexursive.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      9f339caf
  10. 03 12月, 2010 1 次提交
    • N
      do_exit(): make sure that we run with get_fs() == USER_DS · 33dd94ae
      Nelson Elhage 提交于
      If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
      otherwise reset before do_exit().  do_exit may later (via mm_release in
      fork.c) do a put_user to a user-controlled address, potentially allowing
      a user to leverage an oops into a controlled write into kernel memory.
      
      This is only triggerable in the presence of another bug, but this
      potentially turns a lot of DoS bugs into privilege escalations, so it's
      worth fixing.  I have proof-of-concept code which uses this bug along
      with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
      I've tested that this is not theoretical.
      
      A more logical place to put this fix might be when we know an oops has
      occurred, before we call do_exit(), but that would involve changing
      every architecture, in multiple places.
      
      Let's just stick it in do_exit instead.
      
      [akpm@linux-foundation.org: update code comment]
      Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33dd94ae
  11. 01 12月, 2010 2 次提交
  12. 26 11月, 2010 5 次提交
    • H
      nohz: Fix printk_needs_cpu() return value on offline cpus · 61ab2544
      Heiko Carstens 提交于
      This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued
      on offline cpus.
      
      printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
      offlined it schedules the idle process which, before killing its own cpu, will
      call tick_nohz_stop_sched_tick(). That function in turn will call
      printk_needs_cpu() in order to check if the local tick can be disabled. On
      offline cpus this function should naturally return 0 since regardless if the
      tick gets disabled or not the cpu will be dead short after. That is besides the
      fact that __cpu_disable() should already have made sure that no interrupts on
      the offlined cpu will be delivered anyway.
      
      In this case it prevents tick_nohz_stop_sched_tick() to call
      select_nohz_load_balancer(). No idea if that really is a problem. However what
      made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
      used within __mod_timer() to select a cpu on which a timer gets enqueued. If
      printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
      updated when a cpu gets offlined. It may contain the cpu number of an offline
      cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
      they never expire and cause system hangs.
      
      This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
      get_nohz_timer_target() which doesn't have that problem. However there might be
      other problems because of the too early exit tick_nohz_stop_sched_tick() in
      case a cpu goes offline.
      
      Easiest way to fix this is just to test if the current cpu is offline and call
      printk_tick() directly which clears the condition.
      
      Alternatively I tried a cpu hotplug notifier which would clear the condition,
      however between calling the notifier function and printk_needs_cpu() something
      could have called printk() again and the problem is back again. This seems to
      be the safest fix.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <20101126120235.406766476@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      61ab2544
    • H
      printk: Fix wake_up_klogd() vs cpu hotplug · 49f41383
      Heiko Carstens 提交于
      wake_up_klogd() may get called from preemptible context but uses
      __raw_get_cpu_var() to write to a per cpu variable. If it gets preempted
      between getting the address and writing to it, the cpu in question could be
      offline if the process gets scheduled back and hence writes to the per cpu data
      of an offline cpu.
      
      This buggy behaviour was introduced with fa33507a "printk: robustify
      printk, fix #2" which was supposed to fix a "using smp_processor_id() in
      preemptible" warning.
      
      Let's use this_cpu_write() instead which disables preemption and makes sure
      that the outlined scenario cannot happen.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101126124247.GC7023@osiris.boeblingen.de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49f41383
    • P
      perf: Fix the software context switch counter · ee6dcfa4
      Peter Zijlstra 提交于
      Stephane noticed that because the perf_sw_event() call is inside the
      perf_event_task_sched_out() call it won't get called unless we
      have a per-task counter.
      Reported-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ee6dcfa4
    • T
      perf: Fix inherit vs. context rotation bug · dddd3379
      Thomas Gleixner 提交于
      It was found that sometimes children of tasks with inherited events had
      one extra event. Eventually it turned out to be due to the list rotation
      no being exclusive with the list iteration in the inheritance code.
      
      Cure this by temporarily disabling the rotation while we inherit the events.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dddd3379
    • H
      workqueue: check the allocation of system_unbound_wq · e5cba24e
      Hitoshi Mitake 提交于
      I found a trivial bug on initialization of workqueue.
      Current init_workqueues doesn't check the result of
      allocation of system_unbound_wq, this should be checked
      like other queues.
      Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e5cba24e
  13. 20 11月, 2010 1 次提交
    • L
      Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking" · 33e0d57f
      Linus Torvalds 提交于
      This reverts commit 59365d13.
      
      It turns out that this can break certain existing user land setups.
      Quoth Sarah Sharp:
      
       "On Wednesday, I updated my branch to commit 460781b5 from linus' tree,
        and my box would not boot.  klogd segfaulted, which stalled the whole
        system.
      
        At first I thought it actually hung the box, but it continued booting
        after 5 minutes, and I was able to log in.  It dropped back to the
        text console instead of the graphical bootup display for that period
        of time.  dmesg surprisingly still works.  I've bisected the problem
        down to this commit (commit 59365d13)
      
        The box is running klogd 1.5.5ubuntu3 (from Jaunty).  Yes, I know
        that's old.  I read the bit in the commit about changing the
        permissions of kallsyms after boot, but if I can't boot that doesn't
        help."
      
      So let's just keep the old default, and encourage distributions to do
      the "chmod -r /proc/kallsyms" in their bootup scripts.  This is not
      worth a kernel option to change default behavior, since it's so easily
      done in user space.
      Reported-and-bisected-by: NSarah Sharp <sarah.a.sharp@linux.intel.com>
      Cc: Marcus Meissner <meissner@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Jesper Juhl <jj@chaosbits.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33e0d57f
  14. 18 11月, 2010 7 次提交
  15. 17 11月, 2010 1 次提交
    • M
      kernel: make /proc/kallsyms mode 400 to reduce ease of attacking · 59365d13
      Marcus Meissner 提交于
      Making /proc/kallsyms readable only for root by default makes it
      slightly harder for attackers to write generic kernel exploits by
      removing one source of knowledge where things are in the kernel.
      
      This is the second submit, discussion happened on this on first submit
      and mostly concerned that this is just one hole of the sieve ...  but
      one of the bigger ones.
      
      Changing the permissions of at least System.map and vmlinux is also
      required to fix the same set, but a packaging issue.
      
      Target of this starter patch and follow ups is removing any kind of
      kernel space address information leak from the kernel.
      
      [ Side note: the default of root-only reading is the "safe" value, and
        it's easy enough to then override at any time after boot.  The /proc
        filesystem allows root to change the permissions with a regular
        chmod, so you can "revert" this at run-time by simply doing
      
          chmod og+r /proc/kallsyms
      
        as root if you really want regular users to see the kernel symbols.
        It does help some tools like "perf" figure them out without any
        setup, so it may well make sense in some situations.  - Linus ]
      Signed-off-by: NMarcus Meissner <meissner@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NEugene Teo <eugeneteo@kernel.org>
      Reviewed-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59365d13
  16. 16 11月, 2010 3 次提交
  17. 13 11月, 2010 1 次提交
    • S
      tracing: Fix recursive user stack trace · 91e86e56
      Steven Rostedt 提交于
      The user stack trace can fault when examining the trace. Which
      would call the do_page_fault handler, which would trace again,
      which would do the user stack trace, which would fault and call
      do_page_fault again ...
      
      Thus this is causing a recursive bug. We need to have a recursion
      detector here.
      
      [ Resubmitted by Jiri Olsa ]
      
      [ Eric Dumazet recommended using __this_cpu_* instead of __get_cpu_* ]
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      LKML-Reference: <1289390172-9730-3-git-send-email-jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      91e86e56
  18. 12 11月, 2010 2 次提交