1. 23 1月, 2011 1 次提交
    • B
      genirq: Add IRQ affinity notifiers · cd7eab44
      Ben Hutchings 提交于
      When initiating I/O on a multiqueue and multi-IRQ device, we may want
      to select a queue for which the response will be handled on the same
      or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add a
      notification mechanism to support this.
      
      This is based closely on work by Thomas Gleixner <tglx@linutronix.de>.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Cc: linux-net-drivers@solarflare.com
      Cc: Tom Herbert <therbert@google.com>
      Cc: David Miller <davem@davemloft.net>
      LKML-Reference: <1295470904.11126.84.camel@bwh-desktop>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      cd7eab44
  2. 21 1月, 2011 3 次提交
    • T
      genirq: Remove __do_IRQ · 1c77ff22
      Thomas Gleixner 提交于
      All architectures are finally converted. Remove the cruft.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      1c77ff22
    • M
      kernel/smp.c: consolidate writes in smp_call_function_interrupt() · 225c8e01
      Milton Miller 提交于
      We have to test the cpu mask in the interrupt handler before checking the
      refs, otherwise we can start to follow an entry before its deleted and
      find it partially initailzed for the next trip.  Presently we also clear
      the cpumask bit before executing the called function, which implies
      getting write access to the line.  After the function is called we then
      decrement refs, and if they go to zero we then unlock the structure.
      
      However, this implies getting write access to the call function data
      before and after another the function is called.  If we can assert that no
      smp_call_function execution function is allowed to enable interrupts, then
      we can move both writes to after the function is called, hopfully allowing
      both writes with one cache line bounce.
      
      On a 256 thread system with a kernel compiled for 1024 threads, the time
      to execute testcase in the "smp_call_function_many race" changelog was
      reduced by about 30-40ms out of about 545 ms.
      
      I decided to keep this as WARN because its now a buggy function, even
      though the stack trace is of no value -- a simple printk would give us the
      information needed.
      
      Raw data:
      
      Without patch:
        ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
        ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
        ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
        ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
        ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
        ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
        ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
        ipi_test startup took 21245824ns complete 530280180ns total 551526004ns
      
      With patch:
        ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
        ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
        ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
        ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
        ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
        ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
        ipi_test startup took 6789954ns complete 493388112ns total 500178066ns
      
      	#include <linux/module.h>
      	#include <linux/init.h>
      	#include <linux/sched.h> /* sched clock */
      
      	#define ITERATIONS 100
      
      	static void do_nothing_ipi(void *dummy)
      	{
      	}
      
      	static void do_ipis(struct work_struct *dummy)
      	{
      		int i;
      
      		for (i = 0; i < ITERATIONS; i++)
      			smp_call_function(do_nothing_ipi, NULL, 1);
      
      		printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      	}
      
      	static struct work_struct work[NR_CPUS];
      
      	static int __init testcase_init(void)
      	{
      		int cpu;
      		u64 start, started, done;
      
      		start = local_clock();
      		for_each_online_cpu(cpu) {
      			INIT_WORK(&work[cpu], do_ipis);
      			schedule_work_on(cpu, &work[cpu]);
      		}
      		started = local_clock();
      		for_each_online_cpu(cpu)
      			flush_work(&work[cpu]);
      		done = local_clock();
      		pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
      			started-start, done-started, done-start);
      
      		return 0;
      	}
      
      	static void __exit testcase_exit(void)
      	{
      	}
      
      	module_init(testcase_init)
      	module_exit(testcase_exit)
      	MODULE_LICENSE("GPL");
      	MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      225c8e01
    • A
      kernel/smp.c: fix smp_call_function_many() SMP race · 6dc19899
      Anton Blanchard 提交于
      I noticed a failure where we hit the following WARN_ON in
      generic_smp_call_function_interrupt:
      
                      if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                              continue;
      
                      data->csd.func(data->csd.info);
      
                      refs = atomic_dec_return(&data->refs);
                      WARN_ON(refs < 0);      <-------------------------
      
      We atomically tested and cleared our bit in the cpumask, and yet the
      number of cpus left (ie refs) was 0.  How can this be?
      
      It turns out commit 54fdade1
      ("generic-ipi: make struct call_function_data lockless") is at fault.  It
      removes locking from smp_call_function_many and in doing so creates a
      rather complicated race.
      
      The problem comes about because:
      
       - The smp_call_function_many interrupt handler walks call_function.queue
         without any locking.
       - We reuse a percpu data structure in smp_call_function_many.
       - We do not wait for any RCU grace period before starting the next
         smp_call_function_many.
      
      Imagine a scenario where CPU A does two smp_call_functions back to back,
      and CPU B does an smp_call_function in between.  We concentrate on how CPU
      C handles the calls:
      
      CPU A            CPU B                  CPU C              CPU D
      
      smp_call_function
                                              smp_call_function_interrupt
                                                  walks
      					call_function.queue sees
      					data from CPU A on list
      
                       smp_call_function
      
                                              smp_call_function_interrupt
                                                  walks
      
                                              call_function.queue sees
                                                (stale) CPU A on list
      							   smp_call_function int
      							   clears last ref on A
      							   list_del_rcu, unlock
      smp_call_function reuses
      percpu *data A
                                               data->cpumask sees and
                                               clears bit in cpumask
                                               might be using old or new fn!
                                               decrements refs below 0
      
      set data->refs (too late!)
      
      The important thing to note is since the interrupt handler walks a
      potentially stale call_function.queue without any locking, then another
      cpu can view the percpu *data structure at any time, even when the owner
      is in the process of initialising it.
      
      The following test case hits the WARN_ON 100% of the time on my PowerPC
      box (having 128 threads does help :)
      
      #include <linux/module.h>
      #include <linux/init.h>
      
      #define ITERATIONS 100
      
      static void do_nothing_ipi(void *dummy)
      {
      }
      
      static void do_ipis(struct work_struct *dummy)
      {
      	int i;
      
      	for (i = 0; i < ITERATIONS; i++)
      		smp_call_function(do_nothing_ipi, NULL, 1);
      
      	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      }
      
      static struct work_struct work[NR_CPUS];
      
      static int __init testcase_init(void)
      {
      	int cpu;
      
      	for_each_online_cpu(cpu) {
      		INIT_WORK(&work[cpu], do_ipis);
      		schedule_work_on(cpu, &work[cpu]);
      	}
      
      	return 0;
      }
      
      static void __exit testcase_exit(void)
      {
      }
      
      module_init(testcase_init)
      module_exit(testcase_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      
      I tried to fix it by ordering the read and the write of ->cpumask and
      ->refs.  In doing so I missed a critical case but Paul McKenney was able
      to spot my bug thankfully :) To ensure we arent viewing previous
      iterations the interrupt handler needs to read ->refs then ->cpumask then
      ->refs _again_.
      
      Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
      
      [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
      [miltonm@bga.com: remove excess tests]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org> [2.6.32+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dc19899
  3. 20 1月, 2011 2 次提交
  4. 19 1月, 2011 3 次提交
  5. 18 1月, 2011 7 次提交
  6. 15 1月, 2011 2 次提交
  7. 14 1月, 2011 20 次提交
    • P
      rcu: avoid pointless blocked-task warnings · b24efdfd
      Paul E. McKenney 提交于
      If the RCU callback-processing kthread has nothing to do, it parks in
      a wait_event().  If RCU remains idle for more than two minutes, the
      kernel complains about this.  This commit changes from wait_event()
      to wait_event_interruptible() to prevent the kernel from complaining
      just because RCU is idle.
      Reported-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NThomas Weber <weber@corscience.de>
      Tested-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      b24efdfd
    • P
      rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status · c072a388
      Paul E. McKenney 提交于
      Because the adaptive synchronize_srcu_expedited() approach has
      worked very well in testing, remove the kernel parameter and
      replace it by a C-preprocessor macro.  If someone finds problems
      with this approach, a more complex and aggressively adaptive
      approach might be required.
      
      Longer term, SRCU will be merged with the other RCU implementations,
      at which point synchronize_srcu_expedited() will be event driven,
      just as synchronize_sched_expedited() currently is.  At that point,
      there will be no need for this adaptive approach.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c072a388
    • L
      cgroups: Fix a lockdep warning at cgroup removal · 3ec762ad
      Li Zefan 提交于
      Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
      lock, which caused a lockdep warning.
      Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      3ec762ad
    • A
      thp: khugepaged · ba76149f
      Andrea Arcangeli 提交于
      Add khugepaged to relocate fragmented pages into hugepages if new
      hugepages become available.  (this is indipendent of the defrag logic that
      will have to make new hugepages available)
      
      The fundamental reason why khugepaged is unavoidable, is that some memory
      can be fragmented and not everything can be relocated.  So when a virtual
      machine quits and releases gigabytes of hugepages, we want to use those
      freely available hugepages to create huge-pmd in the other virtual
      machines that may be running on fragmented memory, to maximize the CPU
      efficiency at all times.  The scan is slow, it takes nearly zero cpu time,
      except when it copies data (in which case it means we definitely want to
      pay for that cpu time) so it seems a good tradeoff.
      
      In addition to the hugepages being released by other process releasing
      memory, we have the strong suspicion that the performance impact of
      potentially defragmenting hugepages during or before each page fault could
      lead to more performance inconsistency than allocating small pages at
      first and having them collapsed into large pages later...  if they prove
      themselfs to be long lived mappings (khugepaged scan is slow so short
      lived mappings have low probability to run into khugepaged if compared to
      long lived mappings).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba76149f
    • A
      thp: add pmd_huge_pte to mm_struct · e7a00c45
      Andrea Arcangeli 提交于
      This increase the size of the mm struct a bit but it is needed to
      preallocate one pte for each hugepage so that split_huge_page will not
      require a fail path.  Guarantee of success is a fundamental property of
      split_huge_page to avoid decrasing swapping reliability and to avoid
      adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
      VM code to learn rolling back in the middle of its pte mangling operations
      (if something we need it to learn handling pmd_trans_huge natively rather
      being capable of rollback).  When split_huge_page runs a pte is needed to
      succeed the split, to map the newly splitted regular pages with a regular
      pte.  This way all existing VM code remains backwards compatible by just
      adding a split_huge_page* one liner.  The memory waste of those
      preallocated ptes is negligible and so it is worth it.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7a00c45
    • A
      thp: update futex compound knowledge · a5b338f2
      Andrea Arcangeli 提交于
      Futex code is smarter than most other gup_fast O_DIRECT code and knows
      about the compound internals.  However now doing a put_page(head_page)
      will not release the pin on the tail page taken by gup-fast, leading to
      all sort of refcounting bugchecks.  Getting a stable head_page is a little
      tricky.
      
      page_head = page is there because if this is not a tail page it's also the
      page_head.  Only in case this is a tail page, compound_head is called,
      otherwise it's guaranteed unnecessary.  And if it's a tail page
      compound_head has to run atomically inside irq disabled section
      __get_user_pages_fast before returning.  Otherwise ->first_page won't be a
      stable pointer.
      
      Disableing irq before __get_user_page_fast and releasing irq after running
      compound_head is needed because if __get_user_page_fast returns == 1, it
      means the huge pmd is established and cannot go away from under us.
      pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
      wait for local_irq_enable before the IPI delivery can return.  This means
      __split_huge_page_refcount can't be running from under us, and in turn
      when we run compound_head(page) we're not reading a dangling pointer from
      tailpage->first_page.  Then after we get to stable head page, we are
      always safe to call compound_lock and after taking the compound lock on
      head page we can finally re-check if the page returned by gup-fast is
      still a tail page.  in which case we're set and we didn't need to split
      the hugepage in order to take a futex on it.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b338f2
    • M
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines 提交于
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • D
      sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9
      Dave Jones 提交于
      This warning was added in commit bdff746a ("clone: prepare to recycle
      CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
      no-one is actually using CLONE_STOPPED.
      Signed-off-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43bb40c9
    • E
      irq: use per_cpu kstat_irqs · 6c9ae009
      Eric Dumazet 提交于
      Use modern per_cpu API to increment {soft|hard}irq counters, and use
      per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.
      
      This gives better SMP/NUMA locality and saves few instructions per irq.
      
      With small nr_cpuids values (8 for example), kstats_irq was a small array
      (less than L1_CACHE_BYTES), potentially source of false sharing.
      
      In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
      kstat_irqs_all[NR_IRQS][NR_CPUS] array.
      
      Note: we still populate kstats_irq for all possible irqs in
      early_irq_init().  We probably could use on-demand allocations.  (Code
      included in alloc_descs()).  Problem is not all IRQS are used with a prior
      alloc_descs() call.
      
      kstat_irqs_this_cpu() is not used anymore, remove it.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c9ae009
    • A
      pps: capture MONOTONIC_RAW timestamps as well · e2c18e49
      Alexander Gordeev 提交于
      MONOTONIC_RAW clock timestamps are ideally suited for frequency
      calculation and also fit well into the original NTP hardpps design.  Now
      phase and frequency can be adjusted separately: the former based on
      REALTIME clock and the latter based on MONOTONIC_RAW clock.
      
      A new function getnstime_raw_and_real is added to timekeeping subsystem to
      capture both timestamps at the same time and atomically.
      Signed-off-by: NAlexander Gordeev <lasaine@lvk.cs.msu.su>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Rodolfo Giometti <giometti@enneenne.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2c18e49
    • A
      ntp: add hardpps implementation · 025b40ab
      Alexander Gordeev 提交于
      This commit adds hardpps() implementation based upon the original one from
      the NTPv4 reference kernel code from David Mills.  However, it is highly
      optimized towards very fast syncronization and maximum stickness to PPS
      signal.  The typical error is less then a microsecond.
      
      To make it sync faster I had to throw away exponential phase filter so
      that the full phase offset is corrected immediately.  Then I also had to
      throw away median phase filter because it gives a bigger error itself if
      used without exponential filter.
      
      Maybe we will find an appropriate filtering scheme in the future but it's
      not necessary if the signal quality is ok.
      Signed-off-by: NAlexander Gordeev <lasaine@lvk.cs.msu.su>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Rodolfo Giometti <giometti@enneenne.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      025b40ab
    • J
      taskstats: use better ifdef for alignment · 9ab020cf
      Jeff Mahoney 提交于
      Commit 4be2c95d ("taskstats: pad taskstats netlink response for aligment
      issues on ia64") added a null field to align the taskstats structure but
      the discussion centered around ia64.  The issue exists on other platforms
      with inefficient unaligned access and adding them piecemeal would be an
      unmaintainable mess.
      
      This patch uses Dave Miller's suggestion of using a combination of
      CONFIG_64BIT && !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
      whether alignment is needed.
      
      Note that this will cause breakage on those platforms with applications
      like iotop which had hard-coded offsets into the packet to access the
      taskstats structure.
      
      The message seen on systems without the alignment fixes looks like: kernel
      unaligned access to 0xe000023879dca9bc, ip=0xa000000100133d10
      
      The addresses may vary but resolve to locations inside __delayacct_add_tsk.
      
      iotop makes what I'd call unreasonable assumptions about the contents of a
      netlink genetlink packet containing generic attributes.  They're typed and
      have headers that specify value lengths, so the client can (should)
      identify and skip the ones the client doesn't understand.
      
      The kernel, as of version 2.6.36, presented a packet like so:
      +--------------------------------+
      | genlmsghdr - 4 bytes           |
      +--------------------------------+
      | NLA header - 4 bytes           | /* Aggregate header */
      +-+------------------------------+
      | | NLA header - 4 bytes         | /* PID header */
      | +------------------------------+
      | | pid/tgid   - 4 bytes         |
      | +------------------------------+
      | | NLA header - 4 bytes         | /* stats header */
      | + -----------------------------+ <- oops. aligned on 4 byte boundary
      | | struct taskstats - 328 bytes |
      +-+------------------------------+
      
      The iotop code expects that the kernel will behave as it did then,
      assuming that the packet format is set in stone.  The format is set in
      stone, but the packet offsets are not.  There's nothing in the packet
      format that guarantees that the packet will be sent in exactly the same
      way.  The attribute contents are set (or versioned) and the aggregate
      contents are set but they can be anywhere in the packet.
      
      The issue here isn't that an unaligned structure gets passed to userspace,
      it's that the NLA infrastructure has something of a weakness: The 4 byte
      attribute header may force the payload to be unaligned.  The taskstats
      structure is created at an unaligned location and then 64-bit values are
      operated on inside the kernel, so the unaligned access warnings gets
      spewed everywhere.
      
      It's possible to use the unaligned access API to operate on the structure
      in the kernel but it seems like a wasted effort to work around userspace
      code that isn't following the packet format.  Any new additions would also
      need the be worked around.  It's a maintenance nightmare.
      
      The conclusion of the earlier discussion seemed to be "ok fine, if we have
      to break it, don't break it on arches that don't have the problem." Dave
      pointed out that the unaligned access problem doesn't only exist on ia64,
      but also on other 64-bit arches that don't have efficient unaligned access
      and it should be fixed there as well.  The committed version of the patch
      and this addition keep with the conclusion of that discussion not to break
      it unnecessarily, which the pid padding and the packet padding fixes did
      do.  x86_64 and powerpc don't suffer this problem so they shouldn't suffer
      the solution.  Other 64-bit architectures do and will, though.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Florian Mickler <florian@mickler.org>
      Cc: Guillaume Chazarain <guichaz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab020cf
    • P
      user_ns: improve the user_ns on-the-slab packaging · 6164281a
      Pavel Emelyanov 提交于
      Currently on 64-bit arch the user_namespace is 2096 and when being
      kmalloc-ed it resides on a 4k slab wasting 2003 bytes.
      
      If we allocate a separate cache for it and reduce the hash size from 128
      to 64 chains the packaging becomes *much* better - the struct is 1072
      bytes and the hole between is 98 bytes.
      
      [akpm@linux-foundation.org: s/__initcall/module_init/]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NSerge E. Hallyn <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6164281a
    • J
      sysctl: remove obsolete comments · e020e742
      Jovi Zhang 提交于
      ctl_unnumbered.txt have been removed in Documentation directory so just
      also remove this invalid comments
      
      [akpm@linux-foundation.org: fix Documentation/sysctl/00-INDEX, per Dave]
      Signed-off-by: NJovi Zhang <bookjovi@gmail.com>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e020e742
    • J
    • J
      fs/proc/base.c, kernel/latencytop.c: convert sprintf_symbol() to %ps · 34e49d4f
      Joe Perches 提交于
      Use temporary lr for struct latency_record for improved readability and
      fewer columns used.  Removed trailing space from output.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Jiri Kosina <trivial@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e49d4f
    • H
      printk: use RCU to prevent potential lock contention in kmsg_dump · fb842b00
      Huang Ying 提交于
      dump_list_lock is used to protect dump_list in kmsg_dumper implementation,
      kmsg_dump() uses it to traverse dump_list too.  But if there is contention
      on the lock, kmsg_dump() will fail, and the valuable kernel message may be
      lost.
      
      This patch solves this issue with RCU.  Because kmsg_dump() only read the
      list, no lock is needed in kmsg_dump().  So that kmsg_dump() will never
      fail because of lock contention.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb842b00
    • D
      kptr_restrict for hiding kernel pointers from unprivileged users · 455cd5ab
      Dan Rosenberg 提交于
      Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
      sysctl.
      
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      "(nil)".
      
      [akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
      [akpm@linux-foundation.org: coding-style fixup]
      [randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      455cd5ab
    • A
      kernel: clean up USE_GENERIC_SMP_HELPERS · 351f8f8e
      Amerigo Wang 提交于
      For arch which needs USE_GENERIC_SMP_HELPERS, it has to select
      USE_GENERIC_SMP_HELPERS, rather than leaving a choice to user, since they
      don't provide their own implementions.
      
      Also, move on_each_cpu() to kernel/smp.c, it is strange to put it in
      kernel/softirq.c.
      
      For arch which doesn't use USE_GENERIC_SMP_HELPERS, e.g.  blackfin, only
      on_each_cpu() is compiled.
      Signed-off-by: NAmerigo Wang <amwang@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      351f8f8e
    • S
      kmsg_dump: add kmsg_dump() calls to the reboot, halt, poweroff and emergency_restart paths · 04c6862c
      Seiji Aguchi 提交于
      We need to know the reason why system rebooted in support service.
      However, we can't inform our customers of the reason because final
      messages are lost on current Linux kernel.
      
      This patch improves the situation above because the final messages are
      saved by adding kmsg_dump() to reboot, halt, poweroff and
      emergency_restart path.
      Signed-off-by: NSeiji Aguchi <seiji.aguchi@hds.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Reviewed-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c6862c
  8. 13 1月, 2011 1 次提交
  9. 12 1月, 2011 1 次提交