1. 17 12月, 2008 1 次提交
    • M
      x86 smp: modify send_IPI_mask interface to accept cpumask_t pointers · e7986739
      Mike Travis 提交于
      Impact: cleanup, change parameter passing
      
        * Change genapic interfaces to accept cpumask_t pointers where possible.
      
        * Modify external callers to use cpumask_t pointers in function calls.
      
        * Create new send_IPI_mask_allbutself which is the same as the
          send_IPI_mask functions but removes smp_processor_id() from list.
          This removes another common need for a temporary cpumask_t variable.
      
        * Functions that used a temp cpumask_t variable for:
      
      	cpumask_t allbutme = cpu_online_map;
      
      	cpu_clear(smp_processor_id(), allbutme);
      	if (!cpus_empty(allbutme))
      		...
      
          become:
      
      	if (!cpus_equal(cpu_online_map, cpumask_of_cpu(cpu)))
      		...
      
        * Other minor code optimizations (like using cpus_clear instead of
          CPU_MASK_NONE, etc.)
      
      Applies to linux-2.6.tip/master.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      e7986739
  2. 09 12月, 2008 1 次提交
  3. 08 12月, 2008 3 次提交
    • Y
      x86: MSI start irq numbering from nr_irqs_gsi · be5d5350
      Yinghai Lu 提交于
      Impact: sanitize MSI irq number ordering from top-down to bottom-up
      
      Increase new MSI IRQs starting from nr_irqs_gsi (which is somewhere below
      256), instead of decreasing from NR_IRQS. (The latter method can result
      in confusingly high IRQ numbers - if NR_CPUS is set to a high value and
      NR_IRQS scales up to a high value.)
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      be5d5350
    • Y
      x86: use NR_IRQS_LEGACY · 99d093d1
      Yinghai Lu 提交于
      Impact: cleanup
      
      Introduce NR_IRQS_LEGACY instead of hard coded number.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99d093d1
    • Y
      sparse irq_desc[] array: core kernel and x86 changes · 0b8f1efa
      Yinghai Lu 提交于
      Impact: new feature
      
      Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
      NR_CPUS set to large values. The goal is to be able to scale up to much
      larger NR_IRQS value without impacting the (important) common case.
      
      To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
      irq_desc pointers.
      
      When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
      this also makes the IRQ descriptors NUMA-local (to the site that calls
      request_irq()).
      
      This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
      uses desc->chip_data for x86 to store irq_cfg.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b8f1efa
  4. 03 12月, 2008 1 次提交
  5. 01 12月, 2008 2 次提交
  6. 30 11月, 2008 1 次提交
  7. 27 11月, 2008 1 次提交
    • J
      x86: always define DECLARE_PCI_UNMAP* macros · b627c8b1
      Joerg Roedel 提交于
      Impact: fix boot crash on AMD IOMMU if CONFIG_GART_IOMMU is off
      
      Currently these macros evaluate to a no-op except the kernel is compiled
      with GART or Calgary support. But we also need these macros when we have
      SWIOTLB, VT-d or AMD IOMMU in the kernel. Since we always compile at
      least with SWIOTLB we can define these macros always.
      
      This patch is also for stable backport for the same reason the SWIOTLB
      default selection patch is.
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b627c8b1
  8. 26 11月, 2008 4 次提交
  9. 23 11月, 2008 1 次提交
  10. 20 11月, 2008 1 次提交
    • U
      reintroduce accept4 · de11defe
      Ulrich Drepper 提交于
      Introduce a new accept4() system call.  The addition of this system call
      matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
      inotify_init1(), epoll_create1(), pipe2()) which added new system calls
      that differed from analogous traditional system calls in adding a flags
      argument that can be used to access additional functionality.
      
      The accept4() system call is exactly the same as accept(), except that
      it adds a flags bit-mask argument.  Two flags are initially implemented.
      (Most of the new system calls in 2.6.27 also had both of these flags.)
      
      SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
      for the new file descriptor returned by accept4().  This is a useful
      security feature to avoid leaking information in a multithreaded
      program where one thread is doing an accept() at the same time as
      another thread is doing a fork() plus exec().  More details here:
      http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
      Ulrich Drepper).
      
      The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
      to be enabled on the new open file description created by accept4().
      (This flag is merely a convenience, saving the use of additional calls
      fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.
      
      Here's a test program.  Works on x86-32.  Should work on x86-64, but
      I (mtk) don't have a system to hand to test with.
      
      It tests accept4() with each of the four possible combinations of
      SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
      that the appropriate flags are set on the file descriptor/open file
      description returned by accept4().
      
      I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
      and it passes according to my test program.
      
      /* test_accept4.c
      
        Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk
             <mtk.manpages@gmail.com>
      
        Licensed under the GNU GPLv2 or later.
      */
      #define _GNU_SOURCE
      #include <unistd.h>
      #include <sys/syscall.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      #include <stdlib.h>
      #include <fcntl.h>
      #include <stdio.h>
      #include <string.h>
      
      #define PORT_NUM 33333
      
      #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
      
      /**********************************************************************/
      
      /* The following is what we need until glibc gets a wrapper for
        accept4() */
      
      /* Flags for socket(), socketpair(), accept4() */
      #ifndef SOCK_CLOEXEC
      #define SOCK_CLOEXEC    O_CLOEXEC
      #endif
      #ifndef SOCK_NONBLOCK
      #define SOCK_NONBLOCK   O_NONBLOCK
      #endif
      
      #ifdef __x86_64__
      #define SYS_accept4 288
      #elif __i386__
      #define USE_SOCKETCALL 1
      #define SYS_ACCEPT4 18
      #else
      #error "Sorry -- don't know the syscall # on this architecture"
      #endif
      
      static int
      accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
      {
         printf("Calling accept4(): flags = %x", flags);
         if (flags != 0) {
             printf(" (");
             if (flags & SOCK_CLOEXEC)
                 printf("SOCK_CLOEXEC");
             if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
                 printf(" ");
             if (flags & SOCK_NONBLOCK)
                 printf("SOCK_NONBLOCK");
             printf(")");
         }
         printf("\n");
      
      #if USE_SOCKETCALL
         long args[6];
      
         args[0] = fd;
         args[1] = (long) sockaddr;
         args[2] = (long) addrlen;
         args[3] = flags;
      
         return syscall(SYS_socketcall, SYS_ACCEPT4, args);
      #else
         return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
      #endif
      }
      
      /**********************************************************************/
      
      static int
      do_test(int lfd, struct sockaddr_in *conn_addr,
             int closeonexec_flag, int nonblock_flag)
      {
         int connfd, acceptfd;
         int fdf, flf, fdf_pass, flf_pass;
         struct sockaddr_in claddr;
         socklen_t addrlen;
      
         printf("=======================================\n");
      
         connfd = socket(AF_INET, SOCK_STREAM, 0);
         if (connfd == -1)
             die("socket");
         if (connect(connfd, (struct sockaddr *) conn_addr,
                     sizeof(struct sockaddr_in)) == -1)
             die("connect");
      
         addrlen = sizeof(struct sockaddr_in);
         acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
                            closeonexec_flag | nonblock_flag);
         if (acceptfd == -1) {
             perror("accept4()");
             close(connfd);
             return 0;
         }
      
         fdf = fcntl(acceptfd, F_GETFD);
         if (fdf == -1)
             die("fcntl:F_GETFD");
         fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
                    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
         printf("Close-on-exec flag is %sset (%s); ",
                 (fdf & FD_CLOEXEC) ? "" : "not ",
                 fdf_pass ? "OK" : "failed");
      
         flf = fcntl(acceptfd, F_GETFL);
         if (flf == -1)
             die("fcntl:F_GETFD");
         flf_pass = ((flf & O_NONBLOCK) != 0) ==
                    ((nonblock_flag & SOCK_NONBLOCK) !=0);
         printf("nonblock flag is %sset (%s)\n",
                 (flf & O_NONBLOCK) ? "" : "not ",
                 flf_pass ? "OK" : "failed");
      
         close(acceptfd);
         close(connfd);
      
         printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
         return fdf_pass && flf_pass;
      }
      
      static int
      create_listening_socket(int port_num)
      {
         struct sockaddr_in svaddr;
         int lfd;
         int optval;
      
         memset(&svaddr, 0, sizeof(struct sockaddr_in));
         svaddr.sin_family = AF_INET;
         svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
         svaddr.sin_port = htons(port_num);
      
         lfd = socket(AF_INET, SOCK_STREAM, 0);
         if (lfd == -1)
             die("socket");
      
         optval = 1;
         if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
                        sizeof(optval)) == -1)
             die("setsockopt");
      
         if (bind(lfd, (struct sockaddr *) &svaddr,
                  sizeof(struct sockaddr_in)) == -1)
             die("bind");
      
         if (listen(lfd, 5) == -1)
             die("listen");
      
         return lfd;
      }
      
      int
      main(int argc, char *argv[])
      {
         struct sockaddr_in conn_addr;
         int lfd;
         int port_num;
         int passed;
      
         passed = 1;
      
         port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;
      
         memset(&conn_addr, 0, sizeof(struct sockaddr_in));
         conn_addr.sin_family = AF_INET;
         conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
         conn_addr.sin_port = htons(port_num);
      
         lfd = create_listening_socket(port_num);
      
         if (!do_test(lfd, &conn_addr, 0, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
             passed = 0;
      
         close(lfd);
      
         exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
      }
      
      [mtk.manpages@gmail.com: rewrote changelog, updated test program]
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de11defe
  11. 19 11月, 2008 2 次提交
  12. 18 11月, 2008 3 次提交
    • F
      tracing/function-return-tracer: add the overrun field · 0231022c
      Frederic Weisbecker 提交于
      Impact: help to find the better depth of trace
      
      We decided to arbitrary define the depth of function return trace as
      "20". Perhaps this is not enough. To help finding an optimal depth, we
      measure now the overrun: the number of functions that have been missed
      for the current thread. By default this is not displayed, we have to
      do set a particular flag on the return tracer: echo overrun >
      /debug/tracing/trace_options And the overrun will be printed on the
      right.
      
      As the trace shows below, the current 20 depth is not enough.
      
      update_wall_time+0x37f/0x8c0 -> update_xtime_cache (345 ns) (Overruns: 2838)
      update_wall_time+0x384/0x8c0 -> clocksource_get_next (1141 ns) (Overruns: 2838)
      do_timer+0x23/0x100 -> update_wall_time (3882 ns) (Overruns: 2838)
      tick_do_update_jiffies64+0xbf/0x160 -> do_timer (5339 ns) (Overruns: 2838)
      tick_sched_timer+0x6a/0xf0 -> tick_do_update_jiffies64 (7209 ns) (Overruns: 2838)
      vgacon_set_cursor_size+0x98/0x120 -> native_io_delay (2613 ns) (Overruns: 274)
      vgacon_cursor+0x16e/0x1d0 -> vgacon_set_cursor_size (33151 ns) (Overruns: 274)
      set_cursor+0x5f/0x80 -> vgacon_cursor (36432 ns) (Overruns: 274)
      con_flush_chars+0x34/0x40 -> set_cursor (38790 ns) (Overruns: 274)
      release_console_sem+0x1ec/0x230 -> up (721 ns) (Overruns: 274)
      release_console_sem+0x225/0x230 -> wake_up_klogd (316 ns) (Overruns: 274)
      con_flush_chars+0x39/0x40 -> release_console_sem (2996 ns) (Overruns: 274)
      con_write+0x22/0x30 -> con_flush_chars (46067 ns) (Overruns: 274)
      n_tty_write+0x1cc/0x360 -> con_write (292670 ns) (Overruns: 274)
      smp_apic_timer_interrupt+0x2a/0x90 -> native_apic_mem_write (330 ns) (Overruns: 274)
      irq_enter+0x17/0x70 -> idle_cpu (413 ns) (Overruns: 274)
      smp_apic_timer_interrupt+0x2f/0x90 -> irq_enter (1525 ns) (Overruns: 274)
      ktime_get_ts+0x40/0x70 -> getnstimeofday (465 ns) (Overruns: 274)
      ktime_get_ts+0x60/0x70 -> set_normalized_timespec (436 ns) (Overruns: 274)
      ktime_get+0x16/0x30 -> ktime_get_ts (2501 ns) (Overruns: 274)
      hrtimer_interrupt+0x77/0x1a0 -> ktime_get (3439 ns) (Overruns: 274)
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0231022c
    • Y
      x86: fix wakeup_cpu with numaq/es7000, v2, fix · 54ac14a8
      Yinghai Lu 提交于
      Impact: fix wakeup_secondary_cpu with hotplug
      
      We can not put that into x86_quirks, because that is __initdata.
      So try to move that to genapic, and add update_genapic in x86_quirks.
      
      later we even could use that stub to:
      
       1. autodetect CONFIG_ES7000_CLUSTERED_APIC
       2. more correct inquire_remote_apic with apic_verbosity setting.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      54ac14a8
    • Y
      x86: fix wakeup_cpu with numaq/es7000, v2 · 569712b2
      Yinghai Lu 提交于
      Impact: fix secondary-CPU wakeup/init path with numaq and es7000
      
      While looking at wakeup_secondary_cpu for WAKE_SECONDARY_VIA_NMI:
      
      |#ifdef WAKE_SECONDARY_VIA_NMI
      |/*
      | * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
      | * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
      | * won't ... remember to clear down the APIC, etc later.
      | */
      |static int __devinit
      |wakeup_secondary_cpu(int logical_apicid, unsigned long start_eip)
      |{
      |        unsigned long send_status, accept_status = 0;
      |        int maxlvt;
      |...
      |        if (APIC_INTEGRATED(apic_version[phys_apicid])) {
      |                maxlvt = lapic_get_maxlvt();
      
      I noticed that there is no warning about undefined phys_apicid...
      
      because WAKE_SECONDARY_VIA_NMI and WAKE_SECONDARY_VIA_INIT can not be
      defined at the same time. So NUMAQ is using wrong wakeup_secondary_cpu.
      
      WAKE_SECONDARY_VIA_NMI, WAKE_SECONDARY_VIA_INIT and
      WAKE_SECONDARY_VIA_MIP are variants of a weird and fragile
      preprocessor-driven "HAL" mechanisms to specify the kind of secondary-CPU
      wakeup strategy a given x86 kernel will use.
      
      The vast majority of systems want to use INIT for secondary wakeup - NUMAQ
      uses an NMI, (old-style-) ES7000 uses 'MIP' (a firmware driven in-memory
      flag to let secondaries continue).
      
      So convert these mechanisms to x86_quirks and add a
      ->wakeup_secondary_cpu() method to specify the rare exception
      to the sane default.
      
      Extend genapic accordingly as well, for 32-bit.
      
      While looking further, I noticed that functions in wakecup.h for numaq
      and es7000 are different to the default in mach_wakecpu.h - but smpboot.c
      will only use default mach_wakecpu.h with smphook.h.
      
      So we need to add mach_wakecpu.h for mach_generic, to properly support
      numaq and es7000, and vectorize the following SMP init methods:
      
      	int trampoline_phys_low;
      	int trampoline_phys_high;
      	void (*wait_for_init_deassert)(atomic_t *deassert);
      	void (*smp_callin_clear_local_apic)(void);
      	void (*store_NMI_vector)(unsigned short *high, unsigned short *low);
      	void (*restore_NMI_vector)(unsigned short *high, unsigned short *low);
      	void (*inquire_remote_apic)(int apicid);
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      569712b2
  13. 16 11月, 2008 2 次提交
    • S
      ftrace: pass module struct to arch dynamic ftrace functions · 31e88909
      Steven Rostedt 提交于
      Impact: allow archs more flexibility on dynamic ftrace implementations
      
      Dynamic ftrace has largly been developed on x86. Since x86 does not
      have the same limitations as other architectures, the ftrace interaction
      between the generic code and the architecture specific code was not
      flexible enough to handle some of the issues that other architectures
      have.
      
      Most notably, module trampolines. Due to the limited branch distance
      that archs make in calling kernel core code from modules, the module
      load code must create a trampoline to jump to what will make the
      larger jump into core kernel code.
      
      The problem arises when this happens to a call to mcount. Ftrace checks
      all code before modifying it and makes sure the current code is what
      it expects. Right now, there is not enough information to handle modifying
      module trampolines.
      
      This patch changes the API between generic dynamic ftrace code and
      the arch dependent code. There is now two functions for modifying code:
      
        ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into
             a nop, where the original text is calling addr. (mod is the
             module struct if called by module init)
      
        ftrace_make_caller(rec, addr) - convert the code rec->ip that should
             be a nop into a caller to addr.
      
      The record "rec" now has a new field called "arch" where the architecture
      can add any special attributes to each call site record.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      31e88909
    • D
      Revert "x86: blacklist DMAR on Intel G31/G33 chipsets" · 52168e60
      David Woodhouse 提交于
      This reverts commit e51af663, which was
      wrongly hoovered up and submitted about a month after a better fix had
      already been merged.
      
      The better fix is commit cbda1ba8
      ("PCI/iommu: blacklist DMAR on Intel G31/G33 chipsets"), where we do
      this blacklisting based on the DMI identification for the offending
      motherboard, since sometimes this chipset (or at least a chipset with
      the same PCI ID) apparently _does_ actually have an IOMMU.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52168e60
  14. 13 11月, 2008 1 次提交
    • R
      x86, hibernate: fix breakage on x86_32 with CONFIG_NUMA set · 97a70e54
      Rafael J. Wysocki 提交于
      Impact: fix crash during hibernation on 32-bit NUMA
      
      The NUMA code on x86_32 creates special memory mapping that allows
      each node's pgdat to be located in this node's memory.  For this
      purpose it allocates a memory area at the end of each node's memory
      and maps this area so that it is accessible with virtual addresses
      belonging to low memory.  As a result, if there is high memory,
      these NUMA-allocated areas are physically located in high memory,
      although they are mapped to low memory addresses.
      
      Our hibernation code does not take that into account and for this
      reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
      with high memory present.  Fix this by adding a special mapping for
      the NUMA-allocated memory areas to the temporary page tables created
      during the last phase of resume.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      97a70e54
  15. 12 11月, 2008 2 次提交
    • B
      ACPI: pci_link: remove acpi_irq_balance_set() interface · 32836259
      Bjorn Helgaas 提交于
      This removes the acpi_irq_balance_set() interface from the PCI
      interrupt link driver.
      
      x86 used acpi_irq_balance_set() to tell the PCI interrupt link
      driver to configure links to minimize IRQ sharing.  But the link
      driver can easily figure out whether to turn on IRQ balancing
      based on the IRQ model (PIC/IOAPIC/etc), so we can get rid of
      that external interface.
      
      It's better for the driver to figure this out at init-time.  If
      we set it externally via the x86 code, the interface reduces
      modularity, and we depend on the fact that acpi_process_madt()
      happens before we process the kernel command line.
      Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      32836259
    • H
      x86: attempt reboot via port CF9 if we have standard PCI ports · 14d7ca5c
      H. Peter Anvin 提交于
      Impact: Changes reboot behavior.
      
      If port CF9 seems to be safe to touch, attempt it before trying the
      keyboard controller.  Port CF9 is not available on all chipsets (a
      significant but decreasing number of modern chipsets don't implement
      it), but port CF9 itself should in general be safe to poke (no ill
      effects if unimplemented) on any system which has PCI Configuration
      Method #1 or #2, as it falls inside the PCI configuration port range
      in both cases.  No chipset without PCI is known to have port CF9,
      either, although an explicit "pci=bios" would mean we miss this and
      therefore don't use port CF9.  An explicit "reboot=pci" can be used to
      force the use of port CF9.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      14d7ca5c
  16. 11 11月, 2008 2 次提交
    • I
      x86: call machine_shutdown and stop all CPUs in native_machine_halt · d3ec5cae
      Ivan Vecera 提交于
      Impact: really halt all CPUs on halt
      
      Function machine_halt (resp. native_machine_halt) is empty for x86
      architectures. When command 'halt -f' is invoked, the message "System
      halted." is displayed but this is not really true because all CPUs are
      still running.
      
      There are also similar inconsistencies for other arches (some uses
      power-off for halt or forever-loop with IRQs enabled/disabled).
      
      IMO there should be used the same approach for all architectures OR
      what does the message "System halted" really mean?
      
      This patch fixes it for x86.
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d3ec5cae
    • F
      tracing, x86: add low level support for ftrace return tracing · caf4b323
      Frederic Weisbecker 提交于
      Impact: add infrastructure for function-return tracing
      
      Add low level support for ftrace return tracing.
      
      This plug-in stores return addresses on the thread_info structure of
      the current task.
      
      The index of the current return address is initialized when the task
      is the first one (init) and when a process forks (the child). It is
      not needed when a task does a sys_execve because after this syscall,
      it still needs to return on the kernel functions it called.
      
      Note that the code of return_to_handler has been suggested by Steven
      Rostedt as almost all of the ideas of improvements in this V3.
      
      For purpose of security, arch/x86/kernel/process_32.c is not traced
      because __switch_to() changes the current task during its execution.
      That could cause inconsistency in the stored return address of this
      function even if I didn't have any crash after testing with tracing on
      this function enabled.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      caf4b323
  17. 10 11月, 2008 1 次提交
  18. 08 11月, 2008 1 次提交
    • I
      sched: improve sched_clock() performance · 0d12cdd5
      Ingo Molnar 提交于
      in scheduler-intense workloads native_read_tsc() overhead accounts for
      20% of the system overhead:
      
       659567 system_call                              41222.9375
       686796 schedule                                 435.7843
       718382 __switch_to                              665.1685
       823875 switch_mm                                4526.7857
       1883122 native_read_tsc                          55385.9412
       9761990 total                                      2.8468
      
      this is large part due to the rdtsc_barrier() that is done before
      and after reading the TSC.
      
      But sched_clock() is not a precise clock in the GTOD sense, using such
      barriers is completely pointless. So remove the barriers and only use
      them in vget_cycles().
      
      This improves lat_ctx performance by about 5%.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d12cdd5
  19. 06 11月, 2008 3 次提交
    • Y
      x86: remove VISWS and PARAVIRT around NR_IRQS puzzle · 7db282fa
      Yinghai Lu 提交于
      Impact: fix warning message when PARAVIRT is set in config
      
      Remove stale #ifdef components from our IRQ sizing logic.
      x86/Voyager is the only holdout.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7db282fa
    • Y
      x86: size NR_IRQS on 32-bit systems the same way as 64-bit · 1b489768
      Yinghai Lu 提交于
      Impact: make NR_IRQS big enough for system with lots of apic/pins
      
      If lots of IO_APIC's are there (or can be there), size the same way
      as 64-bit, depending on MAX_IO_APICS and NR_CPUS.
      
      This fixes the boot problem reported by Ben Hutchings on a 32-bit
      server with 5 IO-APICs and 240 IO-APIC pins.
      Signed-off-by: NYinghai <yinghai@kernel.org>
      Tested-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1b489768
    • I
      sched: re-tune balancing · 9fcd18c9
      Ingo Molnar 提交于
      Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
      
      Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
      balancing defaults on NUMA and SMP systems.
      
      Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
      why we would not want to have wakeup affinity across nodes as well.
      (we already do this in the standard NUMA template.)
      
      lat_ctx on a NUMA box is particularly happy about this change:
      
      before:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.60
       |   2 5.70
      
      after:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.65
       |   2 2.07
      
      a 2.75x speedup.
      
      pipe-test is similarly happy about it too:
      
       |  phoenix:~/sched-tests> ./pipe-test
       |   18.26 usecs/loop.
       |   14.70 usecs/loop.
       |   14.38 usecs/loop.
       |   10.55 usecs/loop.              # +WAKE_AFFINE on domain0+domain1
       |   8.63 usecs/loop.
       |   8.59 usecs/loop.
       |   9.03 usecs/loop.
       |   8.94 usecs/loop.
       |   8.96 usecs/loop.
       |   8.63 usecs/loop.
      
      Also:
      
       - disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
       - enable SD_WAKE_BALANCE on SMP domains
      
      Sysbench+postgresql improves all around the board, quite significantly:
      
                 .28-rc3-11474e2c  .28-rc3-11474e2c-tune
      -------------------------------------------------
          1:             571              688    +17.08%
          2:            1236             1206    -2.55%
          4:            2381             2642    +9.89%
          8:            4958             5164    +3.99%
         16:            9580             9574    -0.07%
         32:            7128             8118    +12.20%
         64:            7342             8266    +11.18%
        128:            7342             8064    +8.95%
        256:            7519             7884    +4.62%
        512:            7350             7731    +4.93%
      -------------------------------------------------
        SUM:           55412            59341    +6.62%
      
      So it's a win both for the runup portion, the peak area and the tail.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9fcd18c9
  20. 03 11月, 2008 2 次提交
  21. 31 10月, 2008 5 次提交