1. 01 11月, 2011 1 次提交
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
  2. 28 10月, 2011 1 次提交
    • E
      compat: sync compat_stats with statfs. · 1448c721
      Eric W. Biederman 提交于
      This was found by inspection while tracking a similar
      bug in compat_statfs64, that has been fixed in mainline
      since decemeber.
      
      - This fixes a bug where not all of the f_spare fields
        were cleared on mips and s390.
      - Add the f_flags field to struct compat_statfs
      - Copy f_flags to userspace in case someone cares.
      - Use __clear_user to copy the f_spare field to userspace
        to ensure that all of the elements of f_spare are cleared.
        On some architectures f_spare is has 5 ints and on some
        architectures f_spare only has 4 ints.  Which makes
        the previous technique of clearing each int individually
        broken.
      
      I don't expect anyone actually uses the old statfs system
      call anymore but if they do let them benefit from having
      the compat and the native version working the same.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      1448c721
  3. 24 10月, 2011 1 次提交
  4. 15 10月, 2011 1 次提交
  5. 14 10月, 2011 1 次提交
  6. 13 10月, 2011 1 次提交
  7. 10 10月, 2011 6 次提交
    • R
      perf, x86: Implement IBS initialization · b7169166
      Robert Richter 提交于
      This patch implements IBS feature detection and initialzation. The
      code is shared between perf and oprofile. If IBS is available on the
      system for perf, a pmu is setup.
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1316597423-25723-3-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b7169166
    • R
      perf, x86: Share IBS macros between perf and oprofile · ee5789db
      Robert Richter 提交于
      Moving IBS macros from oprofile to <asm/perf_event.h> to make it
      available to perf. No additional changes.
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1316597423-25723-2-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      ee5789db
    • D
      x86, nmi: Add in logic to handle multiple events and unknown NMIs · b227e233
      Don Zickus 提交于
      Previous patches allow the NMI subsystem to process multipe NMI events
      in one NMI.  As previously discussed this can cause issues when an event
      triggered another NMI but is processed in the current NMI.  This causes the
      next NMI to go unprocessed and become an 'unknown' NMI.
      
      To handle this, we first have to flag whether or not the NMI handler handled
      more than one event or not.  If it did, then there exists a chance that
      the next NMI might be already processed.  Once the NMI is flagged as a
      candidate to be swallowed, we next look for a back-to-back NMI condition.
      
      This is determined by looking at the %rip from pt_regs.  If it is the same
      as the previous NMI, it is assumed the cpu did not have a chance to jump
      back into a non-NMI context and execute code and instead handled another NMI.
      
      If both of those conditions are true then we will swallow any unknown NMI.
      
      There still exists a chance that we accidentally swallow a real unknown NMI,
      but for now things seem better.
      
      An optimization has also been added to the nmi notifier rountine.  Because x86
      can latch up to one NMI while currently processing an NMI, we don't have to
      worry about executing _all_ the handlers in a standalone NMI.  The idea is
      if multiple NMIs come in, the second NMI will represent them.  For those
      back-to-back NMI cases, we have the potentail to drop NMIs.  Therefore only
      execute all the handlers in the second half of a detected back-to-back NMI.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1317409584-23662-5-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b227e233
    • D
      x86, nmi: Wire up NMI handlers to new routines · 9c48f1c6
      Don Zickus 提交于
      Just convert all the files that have an nmi handler to the new routines.
      Most of it is straight forward conversion.  A couple of places needed some
      tweaking like kgdb which separates the debug notifier from the nmi handler
      and mce removes a call to notify_die.
      
      [Thanks to Ying for finding out the history behind that mce call
      
      https://lkml.org/lkml/2010/5/27/114
      
      And Boris responding that he would like to remove that call because of it
      
      https://lkml.org/lkml/2011/9/21/163]
      
      The things that get converted are the registeration/unregistration routines
      and the nmi handler itself has its args changed along with code removal
      to check which list it is on (most are on one NMI list except for kgdb
      which has both an NMI routine and an NMI Unknown routine).
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NCorey Minyard <minyard@acm.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Corey Minyard <minyard@acm.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      9c48f1c6
    • D
      x86, nmi: Create new NMI handler routines · c9126b2e
      Don Zickus 提交于
      The NMI handlers used to rely on the notifier infrastructure.  This worked
      great until we wanted to support handling multiple events better.
      
      One of the key ideas to the nmi handling is to process _all_ the handlers for
      each NMI.  The reason behind this switch is because NMIs are edge triggered.
      If enough NMIs are triggered, then they could be lost because the cpu can
      only latch at most one NMI (besides the one currently being processed).
      
      In order to deal with this we have decided to process all the NMI handlers
      for each NMI.  This allows the handlers to determine if they recieved an
      event or not (the ones that can not determine this will be left to fend
      for themselves on the unknown NMI list).
      
      As a result of this change it is now possible to have an extra NMI that
      was destined to be received for an already processed event.  Because the
      event was processed in the previous NMI, this NMI gets dropped and becomes
      an 'unknown' NMI.  This of course will cause printks that scare people.
      
      However, we prefer to have extra NMIs as opposed to losing NMIs and as such
      are have developed a basic mechanism to catch most of them.  That will be
      a later patch.
      
      To accomplish this idea, I unhooked the nmi handlers from the notifier
      routines and created a new mechanism loosely based on doIRQ.  The reason
      for this is the notifier routines have a couple of shortcomings.  One we
      could't guarantee all future NMI handlers used NOTIFY_OK instead of
      NOTIFY_STOP.  Second, we couldn't keep track of the number of events being
      handled in each routine (most only handle one, perf can handle more than one).
      Third, I wanted to eventually display which nmi handlers are registered in
      the system in /proc/interrupts to help see who is generating NMIs.
      
      The patch below just implements the new infrastructure but doesn't wire it up
      yet (that is the next patch).  Its design is based on doIRQ structs and the
      atomic notifier routines.  So the rcu stuff in the patch isn't entirely untested
      (as the notifier routines have soaked it) but it should be double checked in
      case I copied the code wrong.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1317409584-23662-3-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c9126b2e
    • G
      perf, intel: Use GO/HO bits in perf-ctr · 144d31e6
      Gleb Natapov 提交于
      Intel does not have guest/host-only bit in perf counters like AMD
      does.  To support GO/HO bits KVM needs to switch EVENTSELn values
      (or PERF_GLOBAL_CTRL if available) at a guest entry. If a counter is
      configured to count only in a guest mode it stays disabled in a host,
      but VMX is configured to switch it to enabled value during guest entry.
      
      This patch adds GO/HO tracking to Intel perf code and provides interface
      for KVM to get a list of MSRs that need to be switched on a guest entry.
      
      Only cpus with architectural PMU (v1 or later) are supported with this
      patch.  To my knowledge there is not p6 models with VMX but without
      architectural PMU and p4 with VMX are rare and the interface is general
      enough to support them if need arise.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1317816084-18026-7-git-send-email-gleb@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      144d31e6
  8. 06 10月, 2011 1 次提交
  9. 05 10月, 2011 1 次提交
  10. 30 9月, 2011 1 次提交
  11. 29 9月, 2011 3 次提交
    • S
      xen: modify kernel mappings corresponding to granted pages · 0930bba6
      Stefano Stabellini 提交于
      If we want to use granted pages for AIO, changing the mappings of a user
      vma and the corresponding p2m is not enough, we also need to update the
      kernel mappings accordingly.
      Currently this is only needed for pages that are created for user usages
      through /dev/xen/gntdev. As in, pages that have been in use by the
      kernel and use the P2M will not need this special mapping.
      However there are no guarantees that in the future the kernel won't
      start accessing pages through the 1:1 even for internal usage.
      
      In order to avoid the complexity of dealing with highmem, we allocated
      the pages lowmem.
      We issue a HYPERVISOR_grant_table_op right away in
      m2p_add_override and we remove the mappings using another
      HYPERVISOR_grant_table_op in m2p_remove_override.
      Considering that m2p_add_override and m2p_remove_override are called
      once per page we use multicalls and hypercall batching.
      
      Use the kmap_op pointer directly as argument to do the mapping as it is
      guaranteed to be present up until the unmapping is done.
      Before issuing any unmapping multicalls, we need to make sure that the
      mapping has already being done, because we need the kmap->handle to be
      set correctly.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      [v1: Removed GRANT_FRAME_BIT usage]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      0930bba6
    • J
      x86-64: Fix CFI data for interrupt frames · eab9e613
      Jan Beulich 提交于
      The patch titled "x86: Don't use frame pointer to save old stack
      on irq entry" did not properly adjust CFI directives, so this
      patch is a follow-up to that one.
      
      With the old stack pointer no longer stored in a callee-saved
      register (plus some offset), we now have to use a CFA expression
      to describe the memory location where it is being found. This
      requires the use of .cfi_escape (allowing arbitrary byte streams
      to be emitted into .eh_frame), as there is no
      .cfi_def_cfa_expression (which also cannot reasonably be
      expected, as it would require a full expression parser).
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/4E8360200200007800058467@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      eab9e613
    • J
      apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatches · 838312be
      Jan Beulich 提交于
      These warnings (generally one per CPU) are a result of
      initializing x86_cpu_to_logical_apicid while apic_default is
      still in use, but the check in setup_local_APIC() being done
      when apic_bigsmp was already used as an override in
      default_setup_apic_routing():
      
       Overriding APIC driver with bigsmp
       Enabling APIC mode:  Physflat.  Using 5 I/O APICs
       ------------[ cut here ]------------
       WARNING: at .../arch/x86/kernel/apic/apic.c:1239
       ...
       CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000
       Booting Node   0, Processors  #1
       smpboot cpu 1: start_ip = 9e000
       Initializing CPU#1
       ------------[ cut here ]------------
       WARNING: at .../arch/x86/kernel/apic/apic.c:1239
       setup_local_APIC+0x137/0x46b() Hardware name: ...
       CPU1 logical APIC ID: 2 != 8
       ...
      
      Fix this (for the time being, i.e. until
      x86_32_early_logical_apicid() will get removed again, as Tejun
      says ought to be possible) by overriding the previously stored
      values at the point where the APIC driver gets overridden.
      
      v2: Move this and the pre-existing override logic into
          arch/x86/kernel/apic/bigsmp_32.c.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: <stable@kernel.org> (2.6.39 and onwards)
      Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      838312be
  12. 28 9月, 2011 2 次提交
  13. 27 9月, 2011 2 次提交
  14. 26 9月, 2011 9 次提交
  15. 24 9月, 2011 1 次提交
  16. 21 9月, 2011 3 次提交
  17. 16 9月, 2011 1 次提交
    • L
      asm alternatives: remove incorrect alignment notes · a7f934d4
      Linus Torvalds 提交于
      On x86-64, they were just wasteful: with the explicitly added (now
      unnecessary) padding, the size of the alternatives structure was 16
      bytes, and an alignment of 8 bytes didn't hurt much.
      
      However, it was still silly, since the natural size and alignment for
      the structure is actually just 12 bytes, 4-byte aligned since commit
      59e97e4d ("x86: Make alternative instruction pointers relative").
      So removing the padding, and removing the extra alignment is just a good
      idea.
      
      On x86-32, the alignment of 4 bytes was correct, but was incorrectly
      hardcoded as 8 bytes in <asm/alternative-asm.h>.  That header file had
      used to be an x86-64 only header file, but various unification efforts
      have made it be used for x86-32 too (ie the unification of rwlock and
      rwsem).
      
      That in turn caused x86-32 boot failures, because the extra alignment
      would result in random zero-filled words in the altinstructions section,
      causing oopses early at boot when doing alternative instruction
      replacement.
      
      So just remove all the alignment noise entirely.  It's wrong, and it's
      unnecessary.  The section itself is already properly aligned by the
      linker scripts, and all additions to the section had better be of the
      proper 12-byte format, keeping it aligned.  So if the align directive
      were to ever make a difference, that would be an indication of a serious
      bug to begin with.
      Reported-by: NWerner Landgraf <w.landgraf@ru.r>
      Acked-by: NAndrew Lutomirski <luto@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7f934d4
  18. 13 9月, 2011 1 次提交
  19. 30 8月, 2011 3 次提交