1. 22 2月, 2012 2 次提交
    • S
      uprobes/core: Remove uprobe_opcode_sz · 96379f60
      Srikar Dronamraju 提交于
      uprobe_opcode_sz refers to the smallest instruction size for
      that architecture. UPROBES_BKPT_INSN_SIZE refers to the size of
      the breakpoint instruction for that architecture.
      
      For now we are assuming that both uprobe_opcode_sz and
      UPROBES_BKPT_INSN_SIZE are the same for all archs and hence
      removing uprobe_opcode_sz in favour of UPROBES_BKPT_INSN_SIZE.
      
      However if we have to support architectures where the smallest
      instruction size is different from the size of breakpoint
      instruction, we may have to re-introduce uprobe_opcode_sz.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Josh Stone <jistone@redhat.com>
      Link: http://lkml.kernel.org/r/20120222091549.15880.67020.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      96379f60
    • I
      uprobes: Move to kernel/events/ · a5f4374a
      Ingo Molnar 提交于
      Consolidate the uprobes code under kernel/events/, where the various
      core kernel event handling routines live.
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Link: http://lkml.kernel.org/n/tip-biuyhhwohxgbp2vzbap5yr8o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a5f4374a
  2. 17 2月, 2012 2 次提交
    • I
      uprobes/core: Clean up, refactor and improve the code · 7b2d81d4
      Ingo Molnar 提交于
      Make the uprobes code readable to me:
      
       - improve the Kconfig text so that a mere mortal gets some idea
         what CONFIG_UPROBES=y is really about
      
       - do trivial renames to standardize around the uprobes_*() namespace
      
       - clean up and simplify various code flow details
      
       - separate basic blocks of functionality
      
       - line break artifact and white space related removal
      
       - use standard local varible definition blocks
      
       - use vertical spacing to make things more readable
      
       - remove unnecessary volatile
      
       - restructure comment blocks to make them more uniform and
         more readable in general
      
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      7b2d81d4
    • S
      uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints · 2b144498
      Srikar Dronamraju 提交于
      Add uprobes support to the core kernel, with x86 support.
      
      This commit adds the kernel facilities, the actual uprobes
      user-space ABI and perf probe support comes in later commits.
      
      General design:
      
      Uprobes are maintained in an rb-tree indexed by inode and offset
      (the offset here is from the start of the mapping). For a unique
      (inode, offset) tuple, there can be at most one uprobe in the
      rb-tree.
      
      Since the (inode, offset) tuple identifies a unique uprobe, more
      than one user may be interested in the same uprobe. This provides
      the ability to connect multiple 'consumers' to the same uprobe.
      
      Each consumer defines a handler and a filter (optional). The
      'handler' is run every time the uprobe is hit, if it matches the
      'filter' criteria.
      
      The first consumer of a uprobe causes the breakpoint to be
      inserted at the specified address and subsequent consumers are
      appended to this list.  On subsequent probes, the consumer gets
      appended to the existing list of consumers. The breakpoint is
      removed when the last consumer unregisters. For all other
      unregisterations, the consumer is removed from the list of
      consumers.
      
      Given a inode, we get a list of the mms that have mapped the
      inode. Do the actual registration if mm maps the page where a
      probe needs to be inserted/removed.
      
      We use a temporary list to walk through the vmas that map the
      inode.
      
      - The number of maps that map the inode, is not known before we
        walk the rmap and keeps changing.
      - extending vm_area_struct wasn't recommended, it's a
        size-critical data structure.
      - There can be more than one maps of the inode in the same mm.
      
      We add callbacks to the mmap methods to keep an eye on text vmas
      that are of interest to uprobes.  When a vma of interest is mapped,
      we insert the breakpoint at the right address.
      
      Uprobe works by replacing the instruction at the address defined
      by (inode, offset) with the arch specific breakpoint
      instruction. We save a copy of the original instruction at the
      uprobed address.
      
      This is needed for:
      
       a. executing the instruction out-of-line (xol).
       b. instruction analysis for any subsequent fixups.
       c. restoring the instruction back when the uprobe is unregistered.
      
      We insert or delete a breakpoint instruction, and this
      breakpoint instruction is assumed to be the smallest instruction
      available on the platform. For fixed size instruction platforms
      this is trivially true, for variable size instruction platforms
      the breakpoint instruction is typically the smallest (often a
      single byte).
      
      Writing the instruction is done by COWing the page and changing
      the instruction during the copy, this even though most platforms
      allow atomic writes of the breakpoint instruction. This also
      mirrors the behaviour of a ptrace() memory write to a PRIVATE
      file map.
      
      The core worker is derived from KSM's replace_page() logic.
      
      In essence, similar to KSM:
      
       a. allocate a new page and copy over contents of the page that
          has the uprobed vaddr
       b. modify the copy and insert the breakpoint at the required
          address
       c. switch the original page with the copy containing the
          breakpoint
       d. flush page tables.
      
      replace_page() is being replicated here because of some minor
      changes in the type of pages and also because Hugh Dickins had
      plans to improve replace_page() for KSM specific work.
      
      Instruction analysis on x86 is based on instruction decoder and
      determines if an instruction can be probed and determines the
      necessary fixups after singlestep.  Instruction analysis is done
      at probe insertion time so that we avoid having to repeat the
      same analysis every time a probe is hit.
      
      A lot of code here is due to the improvement/suggestions/inputs
      from Peter Zijlstra.
      
      Changelog:
      
      (v10):
       - Add code to clear REX.B prefix as suggested by Denys Vlasenko
         and Masami Hiramatsu.
      
      (v9):
       - Use insn_offset_modrm as suggested by Masami Hiramatsu.
      
      (v7):
      
       Handle comments from Peter Zijlstra:
      
       - Dont take reference to inode. (expect inode to uprobe_register to be sane).
       - Use PTR_ERR to set the return value.
       - No need to take reference to inode.
       - use PTR_ERR to return error value.
       - register and uprobe_unregister share code.
      
      (v5):
      
       - Modified del_consumer as per comments from Peter.
       - Drop reference to inode before dropping reference to uprobe.
       - Use i_size_read(inode) instead of inode->i_size.
       - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
       - Includes errno.h as recommended by Stephen Rothwell to fix a build issue
         on sparc defconfig
       - Remove restrictions while unregistering.
       - Earlier code leaked inode references under some conditions while
         registering/unregistering.
       - Continue the vma-rmap walk even if the intermediate vma doesnt
         meet the requirements.
       - Validate the vma found by find_vma before inserting/removing the
         breakpoint
       - Call del_consumer under mutex_lock.
       - Use hash locks.
       - Handle mremap.
       - Introduce find_least_offset_node() instead of close match logic in
         find_uprobe
       - Uprobes no more depends on MM_OWNER; No reference to task_structs
         while inserting/removing a probe.
       - Uses read_mapping_page instead of grab_cache_page so that the pages
         have valid content.
       - pass NULL to get_user_pages for the task parameter.
       - call SetPageUptodate on the new page allocated in write_opcode.
       - fix leaking a reference to the new page under certain conditions.
       - Include Instruction Decoder if Uprobes gets defined.
       - Remove const attributes for instruction prefix arrays.
       - Uses mm_context to know if the application is 32 bit.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Also-written-by: NJim Keniston <jkenisto@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linux-mm <linux-mm@kvack.org>
      Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
      [ Made various small edits to the commit log ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2b144498
  3. 11 2月, 2012 1 次提交
  4. 03 2月, 2012 1 次提交
  5. 30 1月, 2012 1 次提交
    • R
      PM / Hibernate: Fix s2disk regression related to freezing workqueues · 181e9bde
      Rafael J. Wysocki 提交于
      Commit 2aede851
      
        PM / Hibernate: Freeze kernel threads after preallocating memory
      
      introduced a mechanism by which kernel threads were frozen after
      the preallocation of hibernate image memory to avoid problems with
      frozen kernel threads not responding to memory freeing requests.
      However, it overlooked the s2disk code path in which the
      SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
      which caused freeze_workqueues_begin() to BUG(), because it saw
      that worqueues had been already frozen.
      
      Although in principle this issue might be addressed by removing
      the relevant BUG_ON() from freeze_workqueues_begin(), that would
      reintroduce the very problem that commit 2aede851
      attempted to avoid into that particular code path.  For this reason,
      to fix the issue at hand, introduce thaw_kernel_threads() and make
      the SNAPSHOT_FREE ioctl execute it.
      
      Special thanks to Srivatsa S. Bhat for detailed analysis of the
      problem.
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: stable@kernel.org
      181e9bde
  6. 27 1月, 2012 7 次提交
    • C
      sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW · cb297a3e
      Chanho Min 提交于
      This issue happens under the following conditions:
      
       1. preemption is off
       2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
       3. RT scheduling class
       4. SMP system
      
      Sequence is as follows:
      
       1.suppose current task is A. start schedule()
       2.task A is enqueued pushable task at the entry of schedule()
         __schedule
          prev = rq->curr;
          ...
          put_prev_task
           put_prev_task_rt
            enqueue_pushable_task
       4.pick the task B as next task.
         next = pick_next_task(rq);
       3.rq->curr set to task B and context_switch is started.
         rq->curr = next;
       4.At the entry of context_swtich, release this cpu's rq->lock.
         context_switch
          prepare_task_switch
           prepare_lock_switch
            raw_spin_unlock_irq(&rq->lock);
       5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
       6.try_to_wake_up() which called by ISR acquires rq->lock
          try_to_wake_up
           ttwu_remote
            rq = __task_rq_lock(p)
            ttwu_do_wakeup(rq, p, wake_flags);
              task_woken_rt
       7.push_rt_task picks the task A which is enqueued before.
         task_woken_rt
          push_rt_tasks(rq)
           next_task = pick_next_pushable_task(rq)
       8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
         lowest_rq can be the remote rq.
        (But,If preemption is on, double_lock_balance always return 1 and it
         does't happen.)
         push_rt_task
          find_lock_lowest_rq
           if (double_lock_balance(rq, lowest_rq))..
       9.find_lock_lowest_rq return the available rq. task A is migrated to
         the remote cpu/rq.
         push_rt_task
          ...
          deactivate_task(rq, next_task, 0);
          set_task_cpu(next_task, lowest_rq->cpu);
          activate_task(lowest_rq, next_task, 0);
       10. But, task A is on irq context at this cpu.
           So, task A is scheduled by two cpus at the same time until restore from IRQ.
           Task A's stack is corrupted.
      
      To fix it, don't migrate an RT task if it's still running.
      Signed-off-by: NChanho Min <chanho.min@lge.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cb297a3e
    • S
      perf: Fix broken interrupt rate throttling · e050e3f0
      Stephane Eranian 提交于
      This patch fixes the sampling interrupt throttling mechanism.
      
      It was broken in v3.2. Events were not being unthrottled. The
      unthrottling mechanism required that events be checked at each
      timer tick.
      
      This patch solves this problem and also separates:
      
        - unthrottling
        - multiplexing
        - frequency-mode period adjustments
      
      Not all of them need to be executed at each timer tick.
      
      This third version of the patch is based on my original patch +
      PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
      
      At each timer tick, for each context:
      
        - if the current CPU has throttled events, we unthrottle events
      
        - if context has frequency-based events, we adjust sampling periods
      
        - if we have reached the jiffies interval, we multiplex (rotate)
      
      We decoupled rotation (multiplexing) from frequency-mode sampling
      period adjustments.  They should not necessarily happen at the same
      rate. Multiplexing is subject to jiffies_interval (currently at 1
      but could be higher once the tunable is exposed via sysfs).
      
      We have grouped frequency-mode adjustment and unthrottling into the
      same routine to minimize code duplication. When throttled while in
      frequency mode, we scan the events only once.
      
      We have fixed the threshold enforcement code in __perf_event_overflow().
      There was a bug whereby it would allow more than the authorized rate
      because an increment of hwc->interrupts was not executed at the right
      place.
      
      The patch was tested with low sampling limit (2000) and fixed periods,
      frequency mode, overcommitted PMU.
      
      On a 2.1GHz AMD CPU:
      
       $ cat /proc/sys/kernel/perf_event_max_sample_rate
       2000
      
      We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
      
       $ perf record -e cycles,cycles -c 700000  noploop 10
       $ perf report -D | tail -21
      
       Aggregated stats:
                 TOTAL events:      80086
                  MMAP events:         88
                  COMM events:          2
                  EXIT events:          4
              THROTTLE events:      19996
            UNTHROTTLE events:      19996
                SAMPLE events:      40000
      
       cycles stats:
                 TOTAL events:      40006
                  MMAP events:          5
                  COMM events:          1
                  EXIT events:          4
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
       cycles stats:
                 TOTAL events:      39996
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
      For 10s, the cap is 2x2000x10 = 40000 samples.
      We get exactly that: 20000 samples/event.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: <stable@kernel.org> # v3.2+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120126160319.GA5655@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e050e3f0
    • Y
      sched: Fix ancient race in do_exit() · b5740f4b
      Yasunori Goto 提交于
      try_to_wake_up() has a problem which may change status from TASK_DEAD to
      TASK_RUNNING in race condition with SMI or guest environment of virtual
      machine. As a result, exited task is scheduled() again and panic occurs.
      
      Here is the sequence how it occurs:
      
       ----------------------------------+-----------------------------
                                         |
                  CPU A                  |             CPU B
       ----------------------------------+-----------------------------
      
      TASK A calls exit()....
      
      do_exit()
      
        exit_mm()
          down_read(mm->mmap_sem);
      
          rwsem_down_failed_common()
      
            set TASK_UNINTERRUPTIBLE
            set waiter.task <= task A
            list_add to sem->wait_list
                 :
            raw_spin_unlock_irq()
            (I/O interruption occured)
      
                                            __rwsem_do_wake(mmap_sem)
      
                                              list_del(&waiter->list);
                                              waiter->task = NULL
                                              wake_up_process(task A)
                                                try_to_wake_up()
                                                   (task is still
                                                     TASK_UNINTERRUPTIBLE)
                                                    p->on_rq is still 1.)
      
                                                    ttwu_do_wakeup()
                                                       (*A)
                                                         :
           (I/O interruption handler finished)
      
            if (!waiter.task)
                schedule() is not called
                due to waiter.task is NULL.
      
            tsk->state = TASK_RUNNING
      
                :
                                                    check_preempt_curr();
                                                        :
        task->state = TASK_DEAD
                                                    (*B)
                                              <---    set TASK_RUNNING (*C)
      
           schedule()
           (exit task is running again)
           BUG_ON() is called!
       --------------------------------------------------------
      
      The execution time between (*A) and (*B) is usually very short,
      because the interruption is disabled, and setting TASK_RUNNING at (*C)
      must be executed before setting TASK_DEAD.
      
      HOWEVER, if SMI is interrupted between (*A) and (*B),
      (*C) is able to execute AFTER setting TASK_DEAD!
      Then, exited task is scheduled again, and BUG_ON() is called....
      
      If the system works on guest system of virtual machine, the time
      between (*A) and (*B) may be also long due to scheduling of hypervisor,
      and same phenomenon can occur.
      
      By this patch, do_exit() waits for releasing task->pi_lock which is used
      in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
      waking up.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b5740f4b
    • P
      bugs, x86: Fix printk levels for panic, softlockups and stack dumps · b0f4c4b3
      Prarit Bhargava 提交于
      rsyslog will display KERN_EMERG messages on a connected
      terminal.  However, these messages are useless/undecipherable
      for a general user.
      
      For example, after a softlockup we get:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Stack:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Call Trace:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
       d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89
      
      This happens because the printk levels for these messages are
      incorrect. Only an informational message should be displayed on
      a terminal.
      
      I modified the printk levels for various messages in the kernel
      and tested the output by using the drivers/misc/lkdtm.c kernel
      modules (ie, softlockups, panics, hard lockups, etc.) and
      confirmed that the console output was still the same and that
      the output to the terminals was correct.
      
      For example, in the case of a softlockup we now see the much
      more informative:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
       BUG: soft lockup - CPU4 stuck for 60s!
      
      instead of the above confusing messages.
      
      AFAICT, the messages no longer have to be KERN_EMERG.  In the
      most important case of a panic we set console_verbose().  As for
      the other less severe cases the correct data is output to the
      console and /var/log/messages.
      
      Successfully tested by me using the drivers/misc/lkdtm.c module.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: dzickus@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0f4c4b3
    • S
      sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug · 71325960
      Suresh Siddha 提交于
      With the recent nohz scheduler changes, rq's nohz flag
      'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
      immediately after the cpu exits idle. This gets cleared as part
      of the next tick seen on that cpu.
      
      For the cpu offline support, we need to clear this state
      manually. Fix it by registering a cpu notifier, which clears the
      nohz idle load balance state for this rq explicitly during the
      CPU_DYING notification.
      
      There won't be any nohz updates for that cpu, after the
      CPU_DYING notification. But lets be extra paranoid and skip
      updating the nohz state in the select_nohz_load_balancer() if
      the cpu is not in active state anymore.
      Reported-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-and-tested-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      71325960
    • C
      sched/s390: Fix compile error in sched/core.c · db7e527d
      Christian Borntraeger 提交于
      Commit 029632fb ("sched: Make
      separate sched*.c translation units") removed the include of
      asm/mutex.h from sched.c.
      
      This breaks the combination of:
      
       CONFIG_MUTEX_SPIN_ON_OWNER=yes
       CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX=yes
      
      like s390 without mutex debugging:
      
        CC      kernel/sched/core.o
        kernel/sched/core.c: In function ‘mutex_spin_on_owner’:
        kernel/sched/core.c:3287: error: implicit declaration of function ‘arch_mutex_cpu_relax’
      
      Lets re-add the include to kernel/sched/core.c
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1326268696-30904-1-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      db7e527d
    • P
      sched: Fix rq->nr_uninterruptible update race · 4ca9b72b
      Peter Zijlstra 提交于
      KOSAKI Motohiro noticed the following race:
      
       > CPU0                    CPU1
       > --------------------------------------------------------
       > deactivate_task()
       >                         task->state = TASK_UNINTERRUPTIBLE;
       > activate_task()
       >    rq->nr_uninterruptible--;
       >
       >                         schedule()
       >                           deactivate_task()
       >                             rq->nr_uninterruptible++;
       >
      
      Kosaki-San's scenario is possible when CPU0 runs
      __sched_setscheduler() against CPU1's current @task.
      
      __sched_setscheduler() does a dequeue/enqueue in order to move
      the task to its new queue (position) to reflect the newly provided
      scheduling parameters. However it should be completely invariant to
      nr_uninterruptible accounting, sched_setscheduler() doesn't affect
      readyness to run, merely policy on when to run.
      
      So convert the inappropriate activate/deactivate_task usage to
      enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.
      
      Also convert the two other sites: __migrate_task() and
      normalize_task() that still use activate/deactivate_task. These sites
      aren't really a problem since __migrate_task() will only be called on
      non-running task (and therefore are immume to the described problem)
      and normalize_task() isn't ever used on regular systems.
      
      Also remove the comments from activate/deactivate_task since they're
      misleading at best.
      Reported-by: NKOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>
      4ca9b72b
  7. 24 1月, 2012 3 次提交
  8. 23 1月, 2012 1 次提交
    • G
      net: introduce res_counter_charge_nofail() for socket allocations · 0e90b31f
      Glauber Costa 提交于
      There is a case in __sk_mem_schedule(), where an allocation
      is beyond the maximum, but yet we are allowed to proceed.
      It happens under the following condition:
      
      	sk->sk_wmem_queued + size >= sk->sk_sndbuf
      
      The network code won't revert the allocation in this case,
      meaning that at some point later it'll try to do it. Since
      this is never communicated to the underlying res_counter
      code, there is an inbalance in res_counter uncharge operation.
      
      I see two ways of fixing this:
      
      1) storing the information about those allocations somewhere
         in memcg, and then deducting from that first, before
         we start draining the res_counter,
      2) providing a slightly different allocation function for
         the res_counter, that matches the original behavior of
         the network code more closely.
      
      I decided to go for #2 here, believing it to be more elegant,
      since #1 would require us to do basically that, but in a more
      obscure way.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      CC: Tejun Heo <tj@kernel.org>
      CC: Li Zefan <lizf@cn.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e90b31f
  9. 21 1月, 2012 2 次提交
  10. 20 1月, 2012 1 次提交
  11. 18 1月, 2012 19 次提交