1. 19 7月, 2007 2 次提交
  2. 18 7月, 2007 19 次提交
    • J
      xen: use iret directly when possible · 9ec2b804
      Jeremy Fitzhardinge 提交于
      Most of the time we can simply use the iret instruction to exit the
      kernel, rather than having to use the iret hypercall - the only
      exception is if we're returning into vm86 mode, or from delivering an
      NMI (which we don't support yet).
      
      When running native, iret has the behaviour of testing for a pending
      interrupt atomically with re-enabling interrupts.  Unfortunately
      there's no way to do this with Xen, so there's a window in which we
      could get a recursive exception after enabling events but before
      actually returning to userspace.
      
      This causes a problem: if the nested interrupt causes one of the
      task's TIF_WORK_MASK flags to be set, they will not be checked again
      before returning to userspace.  This means that pending work may be
      left pending indefinitely, until the process enters and leaves the
      kernel again.  The net effect is that a pending signal or reschedule
      event could be delayed for an unbounded amount of time.
      
      To deal with this, the xen event upcall handler checks to see if the
      EIP is within the critical section of the iret code, after events
      are (potentially) enabled up to the iret itself.  If its within this
      range, it calls the iret critical section fixup, which adjusts the
      stack to deal with any unrestored registers, and then shifts the
      stack frame up to replace the previous invocation.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      9ec2b804
    • J
      xen: Attempt to patch inline versions of common operations · 6487673b
      Jeremy Fitzhardinge 提交于
      This patchs adds the mechanism to allow us to patch inline versions of
      common operations.
      
      The implementations of the direct-access versions save_fl, restore_fl,
      irq_enable and irq_disable are now in assembler, and the same code is
      used for both out of line and inline uses.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      6487673b
    • J
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge 提交于
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8
    • J
      Add nosegneg capability to the vsyscall page notes · 24037a8b
      Jeremy Fitzhardinge 提交于
      Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
      used by the runtime linker to select a glibc version which then
      disables negative-offset accesses to the thread-local segment via
      %gs. These accesses require emulation in Xen (because segments are
      truncated to protect the hypervisor address space) and avoiding them
      provides a measurable performance boost.
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: NZachary Amsden <zach@vmware.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      24037a8b
    • J
      Add a sched_clock paravirt_op · 688340ea
      Jeremy Fitzhardinge 提交于
      The tsc-based get_scheduled_cycles interface is not a good match for
      Xen's runstate accounting, which reports everything in nanoseconds.
      
      This patch replaces this interface with a sched_clock interface, which
      matches both Xen and VMI's requirements.
      
      In order to do this, we:
         1. replace get_scheduled_cycles with sched_clock
         2. hoist cycles_2_ns into a common header
         3. update vmi accordingly
      
      One thing to note: because sched_clock is implemented as a weak
      function in kernel/sched.c, we must define a real function in order to
      override this weak binding.  This means the usual paravirt_ops
      technique of using an inline function won't work in this case.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Dan Hecht <dhecht@vmware.com>
      Cc: john stultz <johnstul@us.ibm.com>
      688340ea
    • J
      paravirt: helper to disable all IO space · d572929c
      Jeremy Fitzhardinge 提交于
      In a virtual environment, device drivers such as legacy IDE will waste
      quite a lot of time probing for their devices which will never appear.
      This helper function allows a paravirt implementation to lay claim to
      the whole iomem and ioport space, thereby disabling all device drivers
      trying to claim IO resources.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      d572929c
    • J
      paravirt: make siblingmap functions visible · c70df743
      Jeremy Fitzhardinge 提交于
      Paravirt implementations need to set the sibling map on new cpus.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      c70df743
    • J
      paravirt: unstatic smp_store_cpu_info · 724faa89
      Jeremy Fitzhardinge 提交于
      Paravirt implementations need to store cpu info when bringing up cpus.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      724faa89
    • J
      paravirt: unstatic leave_mm · 53787013
      Jeremy Fitzhardinge 提交于
      Make globally leave_mm visible, specifically so that Xen can use it to
      shoot-down lazy uses of cr3.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      53787013
    • J
      paravirt: add a hook for once the allocator is ready · 6996d3b6
      Jeremy Fitzhardinge 提交于
      Add a hook so that the paravirt backend knows when the allocator is
      ready.  This is useful for the obvious reason that the allocator is
      available, but the other side-effect of having the bootmem allocator
      available is that each page now has an associated "struct page".
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      6996d3b6
    • J
      paravirt: add an "mm" argument to alloc_pt · fdb4c338
      Jeremy Fitzhardinge 提交于
      It's useful to know which mm is allocating a pagetable.  Xen uses this
      to determine whether the pagetable being added to is pinned or not.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      fdb4c338
    • J
      use elfnote.h to generate vsyscall notes. · 810bab44
      Jeremy Fitzhardinge 提交于
      Use existing elfnote.h to generate vsyscall notes, rather than doing
      it locally.  Changes elfnote.h a bit to suit, since this is the first
      asm user, and it wasn't quite right.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.com>
      810bab44
    • A
      sys_fallocate() implementation on i386, x86_64 and powerpc · 97ac7350
      Amit Arora 提交于
      fallocate() is a new system call being proposed here which will allow
      applications to preallocate space to any file(s) in a file system.
      Each file system implementation that wants to use this feature will need
      to support an inode operation called ->fallocate().
      Applications can use this feature to avoid fragmentation to certain
      level and thus get faster access speed. With preallocation, applications
      also get a guarantee of space for particular file(s) - even if later the
      the system becomes full.
      
      Currently, glibc provides an interface called posix_fallocate() which
      can be used for similar cause. Though this has the advantage of working
      on all file systems, but it is quite slow (since it writes zeroes to
      each block that has to be preallocated). Without a doubt, file systems
      can do this more efficiently within the kernel, by implementing
      the proposed fallocate() system call. It is expected that
      posix_fallocate() will be modified to call this new system call first
      and incase the kernel/filesystem does not implement it, it should fall
      back to the current implementation of writing zeroes to the new blocks.
      ToDos:
      1. Implementation on other architectures (other than i386, x86_64,
         and ppc). Patches for s390(x) and ia64 are already available from
         previous posts, but it was decided that they should be added later
         once fallocate is in the mainline. Hence not including those patches
         in this take.
      2. Changes to glibc,
         a) to support fallocate() system call
         b) to make posix_fallocate() and posix_fallocate64() call fallocate()
      Signed-off-by: NAmit Arora <aarora@in.ibm.com>
      97ac7350
    • J
      arch/i386/* fs/* ipc/*: mark variables with uninitialized_var() · 8e1c091c
      Jeff Garzik 提交于
      Mark variables with uninitialized_var() if such a warning appears,
      and analysis proves that the var is initialized properly on all paths
      it is used.
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      8e1c091c
    • A
      i386: speedup touch_nmi_watchdog · f2890255
      Andrew Morton 提交于
      Avoid dirtying remote cpu's memory if it already has the correct value.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Konrad Rzeszutek <konrad@darnok.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2890255
    • A
      PTRACE_POKEDATA consolidation · f284ce72
      Alexey Dobriyan 提交于
      Identical implementations of PTRACE_POKEDATA go into generic_ptrace_pokedata()
      function.
      
      AFAICS, fix bug on xtensa where successful PTRACE_POKEDATA will nevertheless
      return EPERM.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f284ce72
    • A
      PTRACE_PEEKDATA consolidation · 76647323
      Alexey Dobriyan 提交于
      Identical implementations of PTRACE_PEEKDATA go into generic_ptrace_peekdata()
      function.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76647323
    • P
      Report that kernel is tainted if there was an OOPS · bcdcd8e7
      Pavel Emelianov 提交于
      If the kernel OOPSed or BUGed then it probably should be considered as
      tainted.  Thus, all subsequent OOPSes and SysRq dumps will report the
      tainted kernel.  This saves a lot of time explaining oddities in the
      calltraces.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Added parisc patch from Matthew Wilson  -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcdcd8e7
    • R
      Freezer: make kernel threads nonfreezable by default · 83144186
      Rafael J. Wysocki 提交于
      Currently, the freezer treats all tasks as freezable, except for the kernel
      threads that explicitly set the PF_NOFREEZE flag for themselves.  This
      approach is problematic, since it requires every kernel thread to either
      set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
      care for the freezing of tasks at all.
      
      It seems better to only require the kernel threads that want to or need to
      be frozen to use some freezer-related code and to remove any
      freezer-related code from the other (nonfreezable) kernel threads, which is
      done in this patch.
      
      The patch causes all kernel threads to be nonfreezable by default (ie.  to
      have PF_NOFREEZE set by default) and introduces the set_freezable()
      function that should be called by the freezable kernel threads in order to
      unset PF_NOFREEZE.  It also makes all of the currently freezable kernel
      threads call set_freezable(), so it shouldn't cause any (intentional)
      change of behaviour to appear.  Additionally, it updates documentation to
      describe the freezing of tasks more accurately.
      
      [akpm@linux-foundation.org: build fixes]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NNigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83144186
  3. 17 7月, 2007 3 次提交
  4. 16 7月, 2007 2 次提交
  5. 14 7月, 2007 1 次提交
  6. 13 7月, 2007 10 次提交
  7. 12 7月, 2007 1 次提交
    • A
      PCI: Change all drivers to use pci_device->revision · 44c10138
      Auke Kok 提交于
      Instead of all drivers reading pci config space to get the revision
      ID, they can now use the pci_device->revision member.
      
      This exposes some issues where drivers where reading a word or a dword
      for the revision number, and adding useless error-handling around the
      read. Some drivers even just read it for no purpose of all.
      
      In devices where the revision ID is being copied over and used in what
      appears to be the equivalent of hotpath, I have left the copy code
      and the cached copy as not to influence the driver's performance.
      
      Compile tested with make all{yes,mod}config on x86_64 and i386.
      Signed-off-by: NAuke Kok <auke-jan.h.kok@intel.com>
      Acked-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      44c10138
  8. 10 7月, 2007 2 次提交
    • I
      sched: x86, track TSC-unstable events · bb29ab26
      Ingo Molnar 提交于
      track TSC-unstable events and propagate it to the scheduler code.
      Also allow sched_clock() to be used when the TSC is unstable,
      the rq_clock() wrapper creates a reliable clock out of it.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bb29ab26
    • I
      sched: zap the migration init / cache-hot balancing code · 0437e109
      Ingo Molnar 提交于
      the SMP load-balancer uses the boot-time migration-cost estimation
      code to attempt to improve the quality of balancing. The reason for
      this code is that the discrete priority queues do not preserve
      the order of scheduling accurately, so the load-balancer skips
      tasks that were running on a CPU 'recently'.
      
      this code is fundamental fragile: the boot-time migration cost detector
      doesnt really work on systems that had large L3 caches, it caused boot
      delays on large systems and the whole cache-hot concept made the
      balancing code pretty undeterministic as well.
      
      (and hey, i wrote most of it, so i can say it out loud that it sucks ;-)
      
      under CFS the same purpose of cache affinity can be achieved without
      any special cache-hot special-case: tasks are sorted in the 'timeline'
      tree and the SMP balancer picks tasks from the left side of the
      tree, thus the most cache-cold task is balanced automatically.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0437e109