1. 21 3月, 2011 1 次提交
    • S
      introduce sys_syncfs to sync a single file system · b7ed78f5
      Sage Weil 提交于
      It is frequently useful to sync a single file system, instead of all
      mounted file systems via sync(2):
      
       - On machines with many mounts, it is not at all uncommon for some of
         them to hang (e.g. unresponsive NFS server).  sync(2) will get stuck on
         those and may never get to the one you do care about (e.g., /).
       - Some applications write lots of data to the file system and then
         want to make sure it is flushed to disk.  Calling fsync(2) on each
         file introduces unnecessary ordering constraints that result in a large
         amount of sub-optimal writeback/flush/commit behavior by the file
         system.
      
      There are currently two ways (that I know of) to sync a single super_block:
      
       - BLKFLSBUF ioctl on the block device: That also invalidates the bdev
         mapping, which isn't usually desirable, and doesn't work for non-block
         file systems.
       - 'mount -o remount,rw' will call sync_filesystem as an artifact of the
         current implemention.  Relying on this little-known side effect for
         something like data safety sounds foolish.
      
      Both of these approaches require root privileges, which some applications
      do not have (nor should they need?) given that sync(2) is an unprivileged
      operation.
      
      This patch introduces a new system call syncfs(2) that takes an fd and
      syncs only the file system it references.  Maybe someday we can
      
       $ sync /some/path
      
      and not get
      
       sync: ignoring all arguments
      
      The syscall is motivated by comments by Al and Christoph at the last LSF.
      syncfs(2) seems like an appropriate name given statfs(2).
      
      A similar ioctl was also proposed a while back, see
      	http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b7ed78f5
  2. 18 3月, 2011 3 次提交
    • N
      x86, dumpstack: Correct stack dump info when frame pointer is available · e8e999cf
      Namhyung Kim 提交于
      Current stack dump code scans entire stack and check each entry
      contains a pointer to kernel code. If CONFIG_FRAME_POINTER=y it
      could mark whether the pointer is valid or not based on value of
      the frame pointer. Invalid entries could be preceded by '?' sign.
      
      However this was not going to happen because scan start point
      was always higher than the frame pointer so that they could not
      meet.
      
      Commit 9c0729dc ("x86: Eliminate bp argument from the stack
      tracing routines") delayed bp acquisition point, so the bp was
      read in lower frame, thus all of the entries were marked
      invalid.
      
      This patch fixes this by reverting above commit while retaining
      stack_frame() helper as suggested by Frederic Weisbecker.
      
      End result looks like below:
      
      before:
      
       [    3.508329] Call Trace:
       [    3.508551]  [<ffffffff814f35c9>] ? panic+0x91/0x199
       [    3.508662]  [<ffffffff814f3739>] ? printk+0x68/0x6a
       [    3.508770]  [<ffffffff81a981b2>] ? mount_block_root+0x257/0x26e
       [    3.508876]  [<ffffffff81a9821f>] ? mount_root+0x56/0x5a
       [    3.508975]  [<ffffffff81a98393>] ? prepare_namespace+0x170/0x1a9
       [    3.509216]  [<ffffffff81a9772b>] ? kernel_init+0x1d2/0x1e2
       [    3.509335]  [<ffffffff81003894>] ? kernel_thread_helper+0x4/0x10
       [    3.509442]  [<ffffffff814f6880>] ? restore_args+0x0/0x30
       [    3.509542]  [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2
       [    3.509641]  [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10
      
      after:
      
       [    3.522991] Call Trace:
       [    3.523351]  [<ffffffff814f35b9>] panic+0x91/0x199
       [    3.523468]  [<ffffffff814f3729>] ? printk+0x68/0x6a
       [    3.523576]  [<ffffffff81a981b2>] mount_block_root+0x257/0x26e
       [    3.523681]  [<ffffffff81a9821f>] mount_root+0x56/0x5a
       [    3.523780]  [<ffffffff81a98393>] prepare_namespace+0x170/0x1a9
       [    3.523885]  [<ffffffff81a9772b>] kernel_init+0x1d2/0x1e2
       [    3.523987]  [<ffffffff81003894>] kernel_thread_helper+0x4/0x10
       [    3.524228]  [<ffffffff814f6880>] ? restore_args+0x0/0x30
       [    3.524345]  [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2
       [    3.524445]  [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10
      
       -v5:
         * fix build breakage with oprofile
      
       -v4:
         * use 0 instead of regs->bp
         * separate out printk changes
      
       -v3:
         * apply comment from Frederic
         * add a couple of printk fixes
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Soren Sandmann <ssp@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Robert Richter <robert.richter@amd.com>
      LKML-Reference: <1300416006-3163-1-git-send-email-namhyung@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e8e999cf
    • L
      x86: Fix common misspellings · 0d2eb44f
      Lucas De Marchi 提交于
      They were generated by 'codespell' and then manually reviewed.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Cc: trivial@kernel.org
      LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d2eb44f
    • S
      KVM guest: Fix section mismatch derived from kvm_guest_cpu_online() · 775077a0
      Sedat Dilek 提交于
      WARNING: arch/x86/built-in.o(.text+0x1bb74): Section mismatch in reference from the function kvm_guest_cpu_online() to the function .cpuinit.text:kvm_guest_cpu_init()
      The function kvm_guest_cpu_online() references
      the function __cpuinit kvm_guest_cpu_init().
      This is often because kvm_guest_cpu_online lacks a __cpuinit
      annotation or the annotation of kvm_guest_cpu_init is wrong.
      
      This patch fixes the warning.
      
      Tested with linux-next (next-20101231)
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      775077a0
  3. 17 3月, 2011 3 次提交
  4. 16 3月, 2011 4 次提交
  5. 15 3月, 2011 3 次提交
    • M
      x86: stop_machine_text_poke() should issue sync_core() · 0e00f7ae
      Mathieu Desnoyers 提交于
      Intel Archiecture Software Developer's Manual section 7.1.3 specifies that a
      core serializing instruction such as "cpuid" should be executed on _each_ core
      before the new instruction is made visible.
      
      Failure to do so can lead to unspecified behavior (Intel XMC erratas include
      General Protection Fault in the list), so we should avoid this at all cost.
      
      This problem can affect modified code executed by interrupt handlers after
      interrupt are re-enabled at the end of stop_machine, because no core serializing
      instruction is executed between the code modification and the moment interrupts
      are reenabled.
      
      Because stop_machine_text_poke performs the text modification from the first CPU
      decrementing stop_machine_first, modified code executed in thread context is
      also affected by this problem. To explain why, we have to split the CPUs in two
      categories: the CPU that initiates the text modification (calls text_poke_smp)
      and all the others. The scheduler, executed on all other CPUs after
      stop_machine, issues an "iret" core serializing instruction, and therefore
      handles core serialization for all these CPUs. However, the text modification
      initiator can continue its execution on the same thread and access the modified
      text without any scheduler call. Given that the CPU that initiates the code
      modification is not guaranteed to be the one actually performing the code
      modification, it falls into the XMC errata.
      
      Q: Isn't this executed from an IPI handler, which will return with IRET (a
         serializing instruction) anyway?
      A: No, now stop_machine uses per-cpu workqueue, so that handler will be
         executed from worker threads. There is no iret anymore.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      LKML-Reference: <20110303160137.GB1590@Krystal>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: <stable@kernel.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      0e00f7ae
    • A
      x86: Add new syscalls for x86_32 · 7dadb755
      Aneesh Kumar K.V 提交于
      This patch adds new syscalls to x86_32
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7dadb755
    • R
      PM: Drop pm_flags that is not necessary · 6831c6ed
      Rafael J. Wysocki 提交于
      The variable pm_flags is used to prevent APM from being enabled
      along with ACPI, which would lead to problems.  However, acpi_init()
      is always called before apm_init() and after acpi_init() has
      returned, it is known whether or not ACPI will be used.  Namely, if
      acpi_disabled is not set after acpi_init() has returned, this means
      that ACPI is enabled.  Thus, it is sufficient to check acpi_disabled
      in apm_init() to prevent APM from being enabled in parallel with
      ACPI.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NLen Brown <len.brown@intel.com>
      6831c6ed
  6. 12 3月, 2011 9 次提交
  7. 11 3月, 2011 2 次提交
  8. 10 3月, 2011 2 次提交
    • S
      ftrace/graph: Trace function entry before updating index · 722b3c74
      Steven Rostedt 提交于
      Currently the index to the ret_stack is updated and the real return address
      is saved in the ret_stack. Then we call the trace function. The trace
      function could decide that it doesn't want to trace this function
      (ex. set_graph_function does not match) and it will return 0 which means
      not to trace this call.
      
      The normal function graph tracer has this code:
      
      	if (!(trace->depth || ftrace_graph_addr(trace->func)) ||
      	      ftrace_graph_ignore_irqs())
      		return 0;
      
      What this states is, if the trace depth (which is curr_ret_stack)
      is zero (top of nested functions) then test if we want to trace this
      function. If this function is not to be traced, then return  0 and
      the rest of the function graph tracer logic will not trace this function.
      
      The problem arises when an interrupt comes in after we updated the
      curr_ret_stack. The next function that gets called will have a trace->depth
      of 1. Which fools this trace code into thinking that we are in a nested
      function, and that we should trace. This causes interrupts to be traced
      when they should not be.
      
      The solution is to trace the function first and then update the ret_stack.
      Reported-by: Nzhiping zhong <xzhong86@163.com>
      Reported-by: Nwu zhangjin <wuzhangjin@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      722b3c74
    • N
      [CPUFREQ] pcc-cpufreq: don't load driver if get_freq fails during init. · 1f858ef2
      Naga Chumbalkar 提交于
      Return 0 on failure. This will cause the initialization of the driver
      to fail and prevent the driver from loading if the BIOS cannot handle
      the PCC interface command to "get frequency". Otherwise, the driver
      will load and display a very high value like "4294967274" (which is
      actually -EINVAL) for frequency:
      
      # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
      4294967274
      Signed-off-by: NNaga Chumbalkar <nagananda.chumbalkar@hp.com>
      CC: stable@kernel.org
      Signed-off-by: NDave Jones <davej@redhat.com>
      1f858ef2
  9. 09 3月, 2011 4 次提交
    • N
      x86: Don't check for BIOS corruption in first 64K when there's no need to · a7bd1daf
      Naga Chumbalkar 提交于
      Due to commit 781c5a67 it is
      likely that the number of areas to scan for BIOS corruption is 0
       -- especially when the first 64K is already reserved
      (X86_RESERVE_LOW is 64K by default).
      
      If that's the case then don't set up the scan.
      Signed-off-by: NNaga Chumbalkar <nagananda.chumbalkar@hp.com>
      Cc: <stable@kernel.org>
      LKML-Reference: <20110225202838.2229.71011.sendpatchset@nchumbalkar.americas.hpqcorp.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7bd1daf
    • S
      x86: Fix binutils-2.21 symbol related build failures · 2ae9d293
      Sedat Dilek 提交于
      New binutils version 2.21.0.20110302-1 started checking that the symbol
      parameter to the .size directive matches the entry name's
      symbol parameter, unearthing two mismatches:
      
        AS      arch/x86/kernel/acpi/wakeup_rm.o
        arch/x86/kernel/acpi/wakeup_rm.S: Assembler messages:
        arch/x86/kernel/acpi/wakeup_rm.S:12: Error: .size expression with symbol `wakeup_code_start' does not evaluate to a constant
      
        arch/x86/kernel/entry_32.S: Assembler messages:
        arch/x86/kernel/entry_32.S:1421: Error: .size expression with
        symbol `apf_page_fault' does not evaluate to a constant
      
      The problem was discovered while using Debian's binutils
      (2.21.0.20110302-1) and experimenting with binutils from
      upstream.
      
      Thanks Alexander and H.J. for the vital help.
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      LKML-Reference: <1299620364-21644-1-git-send-email-sedat.dilek@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ae9d293
    • J
      kprobes: Disabling optimized kprobes for entry text section · 2a8247a2
      Jiri Olsa 提交于
      You can crash the kernel (with root/admin privileges) using kprobe tracer by running:
      
       echo "p system_call_after_swapgs" > ./kprobe_events
       echo 1 > ./events/kprobes/enable
      
      The reason is that at the system_call_after_swapgs label, the
      kernel stack is not set up. If optimized kprobes are enabled,
      the user space stack is being used in this case (see optimized
      kprobe template) and this might result in a crash.
      
      There are several places like this over the entry code
      (entry_$BIT). As it seems there's no any reasonable/maintainable
      way to disable only those places where the stack is not ready, I
      switched off the whole entry code from kprobe optimizing.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: acme@redhat.com
      Cc: fweisbec@gmail.com
      Cc: ananth@in.ibm.com
      Cc: davem@davemloft.net
      Cc: a.p.zijlstra@chello.nl
      Cc: eric.dumazet@gmail.com
      Cc: 2nddept-manager@sdl.hitachi.co.jp
      LKML-Reference: <1298298313-5980-3-git-send-email-jolsa@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2a8247a2
    • J
      x86: Separate out entry text section · ea714547
      Jiri Olsa 提交于
      Put x86 entry code into a separate link section: .entry.text.
      
      Separating the entry text section seems to have performance
      benefits - caused by more efficient instruction cache usage.
      
      Running hackbench with perf stat --repeat showed that the change
      compresses the icache footprint. The icache load miss rate went
      down by about 15%:
      
       before patch:
               19417627  L1-icache-load-misses      ( +-   0.147% )
      
       after patch:
               16490788  L1-icache-load-misses      ( +-   0.180% )
      
      The motivation of the patch was to fix a particular kprobes
      bug that relates to the entry text section, the performance
      advantage was discovered accidentally.
      
      Whole perf output follows:
      
       - results for current tip tree:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               19417627  L1-icache-load-misses      ( +-   0.147% )
             2676914223  instructions             #      0.497 IPC     ( +- 0.079% )
             5389516026  cycles                     ( +-   0.144% )
      
            0.206267711  seconds time elapsed   ( +-   0.138% )
      
       - results for current tip tree with the patch applied:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               16490788  L1-icache-load-misses      ( +-   0.180% )
             2717734941  instructions             #      0.502 IPC     ( +- 0.079% )
             5414756975  cycles                     ( +-   0.148% )
      
            0.206747566  seconds time elapsed   ( +-   0.137% )
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: masami.hiramatsu.pt@hitachi.com
      Cc: ananth@in.ibm.com
      Cc: davem@davemloft.net
      Cc: 2nddept-manager@sdl.hitachi.co.jp
      LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ea714547
  10. 05 3月, 2011 2 次提交
  11. 04 3月, 2011 4 次提交
    • A
      perf: Fix LLC-* events on Intel Nehalem/Westmere · e994d7d2
      Andi Kleen 提交于
      On Intel Nehalem and Westmere CPUs the generic perf LLC-* events count the
      L2 caches, not the real L3 LLC - this was inconsistent with behavior on
      other CPUs.
      
      Fixing this requires the use of the special OFFCORE_RESPONSE
      events which need a separate mask register.
      
      This has been implemented by the previous patch, now use this infrastructure
      to set correct events for the LLC-* on Nehalem and Westmere.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1299119690-13991-3-git-send-email-ming.m.lin@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e994d7d2
    • A
      perf: Add support for supplementary event registers · a7e3ed1e
      Andi Kleen 提交于
      Change logs against Andi's original version:
      
      - Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
      - Fixed a major event scheduling issue. There cannot be a ref++ on an
        event that has already done ref++ once and without calling
        put_constraint() in between. (Stephane Eranian)
      - Use thread_cpumask for percore allocation. (Lin Ming)
      - Use MSR names in the extra reg lists. (Lin Ming)
      - Remove redundant "c = NULL" in intel_percore_constraints
      - Fix comment of perf_event_attr::config1
      
      Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
      that can be used to monitor any offcore accesses from a core.
      This is a very useful event for various tunings, and it's
      also needed to implement the generic LLC-* events correctly.
      
      Unfortunately this event requires programming a mask in a separate
      register. And worse this separate register is per core, not per
      CPU thread.
      
      This patch:
      
      - Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
        The extra parameters are passed by user space in the
        perf_event_attr::config1 field.
      
      - Adds support to the Intel perf_event core to schedule per
        core resources. This adds fairly generic infrastructure that
        can be also used for other per core resources.
        The basic code has is patterned after the similar AMD northbridge
        constraints code.
      
      Thanks to Stephane Eranian who pointed out some problems
      in the original version and suggested improvements.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7e3ed1e
    • S
      perf_events: Update PEBS event constraints · 17e31629
      Stephane Eranian 提交于
      This patch updates PEBS event constraints for Intel Atom, Nehalem, Westmere.
      
      This patch also reorganizes the PEBS format/constraint detection code. It is
      now based on processor model and not PEBS format. Two processors may use the
      same PEBS format without have the same list of PEBS events.
      
      In this second version, we simplified the initialization of the PEBS
      constraints by leveraging the existing switch() statement in perf_event_intel.c.
      We also renamed the constraint tables to be more consistent with regular
      constraints.
      
      In this 3rd version, we drop BR_INST_RETIRED.MISPRED from Intel Atom as it does
      not seem to work. Use MISPREDICTED_BRANCH_RETIRED instead. Also add FP_ASSIST.*
      o both Intel Nehalem and Westmere. I misssed those in the earlier patches.
      Events were tested using libpfm4 perf_examples.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4d6e6b02.815bdf0a.637b.07a7@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17e31629
    • T
      x86-64, NUMA: Revert NUMA affine page table allocation · f8911250
      Tejun Heo 提交于
      This patch reverts NUMA affine page table allocation added by commit
      1411e0ec (x86-64, numa: Put pgtable to local node memory).
      
      The commit made an undocumented change where the kernel linear mapping
      strictly follows intersection of e820 memory map and NUMA
      configuration.  If the physical memory configuration has holes or NUMA
      nodes are not properly aligned, this leads to using unnecessarily
      smaller mapping size which leads to increased TLB pressure.  For
      details,
      
        http://thread.gmane.org/gmane.linux.kernel/1104672
      
      Patches to fix the problem have been proposed but the underlying code
      needs more cleanup and the approach itself seems a bit heavy handed
      and it has been determined to revert the feature for now and come back
      to it in the next developement cycle.
      
        http://thread.gmane.org/gmane.linux.kernel/1105959
      
      As init_memory_mapping_high() callsites have been consolidated since
      the commit, reverting is done manually.  Also, the RED-PEN comment in
      arch/x86/mm/init.c is not restored as the problem no longer exists
      with memblock based top-down early memory allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      f8911250
  12. 03 3月, 2011 1 次提交
  13. 02 3月, 2011 2 次提交