1. 20 7月, 2007 8 次提交
    • R
      PM: introduce hibernation and suspend notifiers · b10d9117
      Rafael J. Wysocki 提交于
      Make it possible to register hibernation and suspend notifiers, so that
      subsystems can perform hibernation-related or suspend-related operations that
      should not be carried out by device drivers' .suspend() and .resume()
      routines.
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b10d9117
    • R
      Freezer: avoid freezing kernel threads prematurely · 0c1eecfb
      Rafael J. Wysocki 提交于
      Kernel threads should not have TIF_FREEZE set when user space processes are
      being frozen, since otherwise some of them might be frozen prematurely.
      To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
      unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
      check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
      when user space processes are to be frozen.
      
      Namely, when user space processes are being frozen, we only should set
      TIF_FREEZE for tasks that have p->mm different from NULL and don't have
      PF_BORROWED_MM set in p->flags.  For this reason task_lock() must be used to
      prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
      p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p).  Also, we
      need to prevent the following scenario from happening:
      
      * daemonize() is called by a task spawned from a user space code path
      * freezer checks if the task has p->mm set and the result is positive
      * task enters exit_mm() and clears its TIF_FREEZE
      * freezer sets TIF_FREEZE for the task
      * task calls try_to_freeze() and goes to the refrigerator, which is wrong at
        that point
      
      This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
      p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
      out that TIF_FREEZE should not be set).
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c1eecfb
    • R
      swsusp: introduce restore platform operations · a634cc10
      Rafael J. Wysocki 提交于
      At least on some machines it is necessary to prepare the ACPI firmware for the
      restoration of the system memory state from the hibernation image if the
      "platform" mode of hibernation has been used.  Namely, in that cases we need
      to disable the GPEs before replacing the "boot" kernel with the "frozen"
      kernel (cf.  http://bugzilla.kernel.org/show_bug.cgi?id=7887).  After the
      restore they will be re-enabled by hibernation_ops->finish(), but if the
      restore fails, they have to be re-enabled by the restore code explicitly.
      
      For this purpose we can introduce two additional hibernation operations,
      called pre_restore() and restore_cleanup() and call them from the restore code
      path.  Still, they should be called if the "platform" mode of hibernation has
      been used, so we need to pass the information about the hibernation mode from
      the "frozen" kernel to the "boot" kernel in the image header.
      
      Apparently, we can't drop the disabling of GPEs before the restore because of
      Bug #7887 .   We also can't do it unconditionally, because the GPEs wouldn't
      have been enabled after a successful restore if the suspend had been done in
      the 'shutdown' or 'reboot' mode.
      
      In principle we could (and probably should) unconditionally disable the GPEs
      before each snapshot creation *and* before the restore, but then we'd have to
      unconditionally enable them after the snapshot creation as well as after the
      restore (or restore failure)   Still, for this purpose we'd need to modify
      acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to
      introduce some mechanism synchronizing the disablind/enabling of the GPEs with
      the device drivers' .suspend()/.resume() routines and with
      disable_/enable_nonboot_cpus().   However, this would have affected the
      suspend (ie.  s2ram) code as well as the hibernation, which I'd like to avoid
      in this patch series.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a634cc10
    • M
      Remove alloc_zeroed_user_highpage() · bb2d5ce1
      Mel Gorman 提交于
      alloc_zeroed_user_highpage() has no in-tree users and it is not exported.
      As it is not exported, it can simply be removed.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2d5ce1
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
    • N
      mm: fix fault vs invalidate race for linear mappings · d00806b1
      Nick Piggin 提交于
      Fix the race between invalidate_inode_pages and do_no_page.
      
      Andrea Arcangeli identified a subtle race between invalidation of pages from
      pagecache with userspace mappings, and do_no_page.
      
      The issue is that invalidation has to shoot down all mappings to the page,
      before it can be discarded from the pagecache.  Between shooting down ptes to
      a particular page, and actually dropping the struct page from the pagecache,
      do_no_page from any process might fault on that page and establish a new
      mapping to the page just before it gets discarded from the pagecache.
      
      The most common case where such invalidation is used is in file truncation.
      This case was catered for by doing a sort of open-coded seqlock between the
      file's i_size, and its truncate_count.
      
      Truncation will decrease i_size, then increment truncate_count before
      unmapping userspace pages; do_no_page will read truncate_count, then find the
      page if it is within i_size, and then check truncate_count under the page
      table lock and back out and retry if it had subsequently been changed (ptl
      will serialise against unmapping, and ensure a potentially updated
      truncate_count is actually visible).
      
      Complexity and documentation issues aside, the locking protocol fails in the
      case where we would like to invalidate pagecache inside i_size.  do_no_page
      can come in anytime and filemap_nopage is not aware of the invalidation in
      progress (as it is when it is outside i_size).  The end result is that
      dangling (->mapping == NULL) pages that appear to be from a particular file
      may be mapped into userspace with nonsense data.  Valid mappings to the same
      place will see a different page.
      
      Andrea implemented two working fixes, one using a real seqlock, another using
      a page->flags bit.  He also proposed using the page lock in do_no_page, but
      that was initially considered too heavyweight.  However, it is not a global or
      per-file lock, and the page cacheline is modified in do_no_page to increment
      _count and _mapcount anyway, so a further modification should not be a large
      performance hit.  Scalability is not an issue.
      
      This patch implements this latter approach.  ->nopage implementations return
      with the page locked if it is possible for their underlying file to be
      invalidated (in that case, they must set a special vm_flags bit to indicate
      so).  do_no_page only unlocks the page after setting up the mapping
      completely.  invalidation is excluded because it holds the page lock during
      invalidation of each page (and ensures that the page is not mapped while
      holding the lock).
      
      This also allows significant simplifications in do_no_page, because we have
      the page locked in the right place in the pagecache from the start.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d00806b1
  2. 19 7月, 2007 20 次提交
  3. 18 7月, 2007 12 次提交
    • J
      xen: Place vcpu_info structure into per-cpu memory · 60223a32
      Jeremy Fitzhardinge 提交于
      An experimental patch for Xen allows guests to place their vcpu_info
      structs anywhere.  We try to use this to place the vcpu_info into the
      PDA, which allows direct access.
      
      If this works, then switch to using direct access operations for
      irq_enable, disable, save_fl and restore_fl.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      60223a32
    • J
      xen: add virtual block device driver. · 9f27ee59
      Jeremy Fitzhardinge 提交于
      The block device frontend driver allows the kernel to access block
      devices exported exported by a virtual machine containing a physical
      block device driver.
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      9f27ee59
    • J
      xen: add the Xenbus sysfs and virtual device hotplug driver · 4bac07c9
      Jeremy Fitzhardinge 提交于
      This communicates with the machine control software via a registry
      residing in a controlling virtual machine. This allows dynamic
      creation, destruction and modification of virtual device
      configurations (network devices, block devices and CPUS, to name some
      examples).
      
      [ Greg, would you mind giving this a review?  Thanks -J ]
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Greg KH <greg@kroah.com>
      4bac07c9
    • J
      xen: Add grant table support · ad9a8612
      Jeremy Fitzhardinge 提交于
      Add Xen 'grant table' driver which allows granting of access to
      selected local memory pages by other virtual machines and,
      symmetrically, the mapping of remote memory pages which other virtual
      machines have granted access to.
      
      This driver is a prerequisite for many of the Xen virtual device
      drivers, which grant the 'device driver domain' restricted and
      temporary access to only those memory pages that are currently
      involved in I/O operations.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      ad9a8612
    • J
      xen: use the hvc console infrastructure for Xen console · b536b4b9
      Jeremy Fitzhardinge 提交于
      Implement a Xen back-end for hvc console.
      
      * * *
      Add early printk support via hvc console, enable using
      "earlyprintk=xen" on the kernel command line.
      
      From: Gerd Hoffmann <kraxel@suse.de>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NOlof Johansson <olof@lixom.net>
      b536b4b9
    • J
      xen: SMP guest support · f87e4cac
      Jeremy Fitzhardinge 提交于
      This is a fairly straightforward Xen implementation of smp_ops.
      
      Xen has its own IPI mechanisms, and has no dependency on any
      APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
      allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
      operation is a single apic_read for the apic version number).
      
      One subtle point which needs to be addressed is unpinning pagetables
      when another cpu may have a lazy tlb reference to the pagetable. Xen
      will not allow an in-use pagetable to be unpinned, so we must find any
      other cpus with a reference to the pagetable and get them to shoot
      down their references.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      f87e4cac
    • J
      xen: add pinned page flag · c85b04c3
      Jeremy Fitzhardinge 提交于
      Add a new definition for PG_owner_priv_1 to define PG_pinned on Xen
      pagetable pages.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      c85b04c3
    • J
      xen: event channels · e46cdb66
      Jeremy Fitzhardinge 提交于
      Xen implements interrupts in terms of event channels.  Each guest
      domain gets 1024 event channels which can be used for a variety of
      purposes, such as Xen timer events, inter-domain events,
      inter-processor events (IPI) or for real hardware IRQs.
      
      Within the kernel, we map the event channels to IRQs, and implement
      the whole interrupt handling using a Xen irq_chip.
      
      Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
      accomodate Xen, we create a dynamic mapping between event channels and
      IRQs.  Ideally, Linux will eventually move towards dynamically
      allocating per-irq structures, and we can use a 1:1 mapping between
      event channels and irqs.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      e46cdb66
    • J
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge 提交于
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8
    • J
      xen: Add Xen interface header files · a42089dd
      Jeremy Fitzhardinge 提交于
      Add Xen interface header files. These are taken fairly directly from
      the Xen tree, but somewhat rearranged to suit the kernel's conventions.
      
      Define macros and inline functions for doing hypercalls into the
      hypervisor.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      a42089dd
    • J
      Add a sched_clock paravirt_op · 688340ea
      Jeremy Fitzhardinge 提交于
      The tsc-based get_scheduled_cycles interface is not a good match for
      Xen's runstate accounting, which reports everything in nanoseconds.
      
      This patch replaces this interface with a sched_clock interface, which
      matches both Xen and VMI's requirements.
      
      In order to do this, we:
         1. replace get_scheduled_cycles with sched_clock
         2. hoist cycles_2_ns into a common header
         3. update vmi accordingly
      
      One thing to note: because sched_clock is implemented as a weak
      function in kernel/sched.c, we must define a real function in order to
      override this weak binding.  This means the usual paravirt_ops
      technique of using an inline function won't work in this case.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Dan Hecht <dhecht@vmware.com>
      Cc: john stultz <johnstul@us.ibm.com>
      688340ea
    • J
      paravirt: helper to disable all IO space · d572929c
      Jeremy Fitzhardinge 提交于
      In a virtual environment, device drivers such as legacy IDE will waste
      quite a lot of time probing for their devices which will never appear.
      This helper function allows a paravirt implementation to lay claim to
      the whole iomem and ioport space, thereby disabling all device drivers
      trying to claim IO resources.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      d572929c