1. 24 6月, 2005 21 次提交
    • A
      [PATCH] make various thing static · 52c1da39
      Adrian Bunk 提交于
      Another rollup of patches which give various symbols static scope
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      52c1da39
    • M
      [PATCH] modules: add version and srcversion to sysfs · c988d2b2
      Matt Domsch 提交于
      This patch adds version and srcversion files to
      /sys/module/${modulename} containing the version and srcversion fields
      of the module's modinfo section (if present).
      
      /sys/module/e1000
      |-- srcversion
      `-- version
      
      This patch differs slightly from the version posted in January, as it
      now uses the new kstrdup() call in -mm.
      
      Why put this in sysfs?
      
      a) Tools like DKMS, which deal with changing out individual kernel
         modules without replacing the whole kernel, can behave smarter if they
         can tell the version of a given module.  The autoinstaller feature, for
         example, which determines if your system has a "good" version of a
         driver (i.e.  if the one provided by DKMS has a newer verson than that
         provided by the kernel package installed), and to automatically compile
         and install a newer version if DKMS has it but your kernel doesn't yet
         have that version.
      
      b) Because sysadmins manually, or with tools like DKMS, can switch out
         modules on the file system, you can't count on 'modinfo foo.ko', which
         looks at /lib/modules/${kernelver}/...  actually matching what is loaded
         into the kernel already.  Hence asking sysfs for this.
      
      c) as the unbind-driver-from-device work takes shape, it will be
         possible to rebind a driver that's built-in (no .ko to modinfo for the
         version) to a newly loaded module.  sysfs will have the
         currently-built-in version info, for comparison.
      
      d) tech support scripts can then easily grab the version info for what's
         running presently - a question I get often.
      
      There has been renewed interest in this patch on linux-scsi by driver
      authors.
      
      As the idea originated from GregKH, I leave his Signed-off-by: intact,
      though the implementation is nearly completely new.  Compiled and run on
      x86 and x86_64.
      
      From: Matthew Dobson <colpatch@us.ibm.com>
      
            build fix
      
      From: Thierry Vignaud <tvignaud@mandriva.com>
      
            build fix
      
      From: Matthew Dobson <colpatch@us.ibm.com>
      
            warning fix
      Signed-off-by: NGreg Kroah-Hartman <greg@kroah.com>
      Signed-off-by: NMatt Domsch <Matt_Domsch@dell.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c988d2b2
    • D
      [PATCH] Keys: Make request-key create an authorisation key · 3e30148c
      David Howells 提交于
      The attached patch makes the following changes:
      
       (1) There's a new special key type called ".request_key_auth".
      
           This is an authorisation key for when one process requests a key and
           another process is started to construct it. This type of key cannot be
           created by the user; nor can it be requested by kernel services.
      
           Authorisation keys hold two references:
      
           (a) Each refers to a key being constructed. When the key being
           	 constructed is instantiated the authorisation key is revoked,
           	 rendering it of no further use.
      
           (b) The "authorising process". This is either:
      
           	 (i) the process that called request_key(), or:
      
           	 (ii) if the process that called request_key() itself had an
           	      authorisation key in its session keyring, then the authorising
           	      process referred to by that authorisation key will also be
           	      referred to by the new authorisation key.
      
      	 This means that the process that initiated a chain of key requests
      	 will authorise the lot of them, and will, by default, wind up with
      	 the keys obtained from them in its keyrings.
      
       (2) request_key() creates an authorisation key which is then passed to
           /sbin/request-key in as part of a new session keyring.
      
       (3) When request_key() is searching for a key to hand back to the caller, if
           it comes across an authorisation key in the session keyring of the
           calling process, it will also search the keyrings of the process
           specified therein and it will use the specified process's credentials
           (fsuid, fsgid, groups) to do that rather than the calling process's
           credentials.
      
           This allows a process started by /sbin/request-key to find keys belonging
           to the authorising process.
      
       (4) A key can be read, even if the process executing KEYCTL_READ doesn't have
           direct read or search permission if that key is contained within the
           keyrings of a process specified by an authorisation key found within the
           calling process's session keyring, and is searchable using the
           credentials of the authorising process.
      
           This allows a process started by /sbin/request-key to read keys belonging
           to the authorising process.
      
       (5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
           KEYCTL_NEGATE will specify a keyring of the authorising process, rather
           than the process doing the instantiation.
      
       (6) One of the process keyrings can be nominated as the default to which
           request_key() should attach new keys if not otherwise specified. This is
           done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
           constants. The current setting can also be read using this call.
      
       (7) request_key() is partially interruptible. If it is waiting for another
           process to finish constructing a key, it can be interrupted. This permits
           a request-key cycle to be broken without recourse to rebooting.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-Off-By: NBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e30148c
    • D
      [PATCH] Keys: Pass session keyring to call_usermodehelper() · 7888e7ff
      David Howells 提交于
      The attached patch makes it possible to pass a session keyring through to the
      process spawned by call_usermodehelper().  This allows patch 3/3 to pass an
      authorisation key through to /sbin/request-key, thus permitting better access
      controls when doing just-in-time key creation.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7888e7ff
    • B
      [PATCH] aio: make wait_queue ->task ->private · c43dc2fd
      Benjamin LaHaise 提交于
      In the upcoming aio_down patch, it is useful to store a private data
      pointer in the kiocb's wait_queue.  Since we provide our own wake up
      function and do not require the task_struct pointer, it makes sense to
      convert the task pointer into a generic private pointer.
      Signed-off-by: NBenjamin LaHaise <benjamin.c.lahaise@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c43dc2fd
    • C
      [PATCH] Optimize sys_times for a single thread process · 71a2224d
      Christoph Lameter 提交于
      Avoid taking the tasklist_lock in sys_times if the process is single
      threaded.  In a NUMA system taking the tasklist_lock may cause a bouncing
      cacheline if multiple independent processes continually call sys_times to
      measure their performance.
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      71a2224d
    • K
      [PATCH] Software suspend and recalc sigpending bug fix · 4fea2838
      Kirill Korotaev 提交于
      This patch fixes recalc_sigpending() to work correctly with tasks which are
      being freezed.
      
      The problem is that freeze_processes() sets PF_FREEZE and TIF_SIGPENDING
      flags on tasks, but recalc_sigpending() called from e.g.
      sys_rt_sigtimedwait or any other kernel place will clear TIF_SIGPENDING due
      to no pending signals queued and the tasks won't be freezed until it
      recieves a real signal or freezed_processes() fail due to timeout.
      Signed-Off-By: NKirill Korotaev <dev@sw.ru>
      Signed-Off-By: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4fea2838
    • A
      [PATCH] setuid core dump · d6e71144
      Alan Cox 提交于
      Add a new `suid_dumpable' sysctl:
      
      This value can be used to query and set the core dump mode for setuid
      or otherwise protected/tainted binaries. The modes are
      
      0 - (default) - traditional behaviour.  Any process which has changed
          privilege levels or is execute only will not be dumped
      
      1 - (debug) - all processes dump core when possible.  The core dump is
          owned by the current user and no security is applied.  This is intended
          for system debugging situations only.  Ptrace is unchecked.
      
      2 - (suidsafe) - any binary which normally would not be dumped is dumped
          readable by root only.  This allows the end user to remove such a dump but
          not access it directly.  For security reasons core dumps in this mode will
          not overwrite one another or other files.  This mode is appropriate when
          adminstrators are attempting to debug problems in a normal environment.
      
      (akpm:
      
      > > +EXPORT_SYMBOL(suid_dumpable);
      >
      > EXPORT_SYMBOL_GPL?
      
      No problem to me.
      
      > >  	if (current->euid == current->uid && current->egid == current->gid)
      > >  		current->mm->dumpable = 1;
      >
      > Should this be SUID_DUMP_USER?
      
      Actually the feedback I had from last time was that the SUID_ defines
      should go because its clearer to follow the numbers. They can go
      everywhere (and there are lots of places where dumpable is tested/used
      as a bool in untouched code)
      
      > Maybe this should be renamed to `dump_policy' or something.  Doing that
      > would help us catch any code which isn't using the #defines, too.
      
      Fair comment. The patch was designed to be easy to maintain for Red Hat
      rather than for merging. Changing that field would create a gigantic
      diff because it is used all over the place.
      
      )
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6e71144
    • P
      [PATCH] jprobes: allow a jprobe to coexist with muliple kprobes · 8b0914ea
      Prasanna S Panchamukhi 提交于
      Presently either multiple kprobes or only one jprobe could be inserted.
      This patch removes the above limitation and allows one jprobe and multiple
      kprobes to coexist at the same address.  However multiple jprobes cannot
      coexist with multiple kprobes.  Currently I am working on the prototype to
      allow multiple jprobes coexist with multiple kprobes.
      Signed-off-by: NAnanth N Mavinakayanhalli <amavin@redhat.com>
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8b0914ea
    • P
      [PATCH] kprobes: Temporary disarming of reentrant probe · ea32c65c
      Prasanna S Panchamukhi 提交于
      In situations where a kprobes handler calls a routine which has a probe on it,
      then kprobes_handler() disarms the new probe forever.  This patch removes the
      above limitation by temporarily disarming the new probe.  When the another
      probe hits while handling the old probe, the kprobes_handler() saves previous
      kprobes state and handles the new probe without calling the new kprobes
      registered handlers.  kprobe_post_handler() restores back the previous kprobes
      state and the normal execution continues.
      
      However on x86_64 architecture, re-rentrancy is provided only through
      pre_handler().  If a routine having probe is referenced through
      post_handler(), then the probes on that routine are disarmed forever, since
      the exception stack is gets changed after the processor single steps the
      instruction of the new probe.
      
      This patch includes generic changes to support temporary disarming on
      reentrancy of probes.
      Signed-of-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ea32c65c
    • H
      [PATCH] kprobes: moves lock-unlock to non-arch kprobe_flush_task · 0aa55e4d
      Hien Nguyen 提交于
      This patch moves the lock/unlock of the arch specific kprobe_flush_task()
      to the non-arch specific kprobe_flusk_task().
      Signed-off-by: NHien Nguyen <hien@us.ibm.com>
      Acked-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0aa55e4d
    • R
      [PATCH] Move kprobe [dis]arming into arch specific code · 7e1048b1
      Rusty Lynch 提交于
      The architecture independent code of the current kprobes implementation is
      arming and disarming kprobes at registration time.  The problem is that the
      code is assuming that arming and disarming is a just done by a simple write
      of some magic value to an address.  This is problematic for ia64 where our
      instructions look more like structures, and we can not insert break points
      by just doing something like:
      
      *p->addr = BREAKPOINT_INSTRUCTION;
      
      The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
      functions:
      
           * void arch_arm_kprobe(struct kprobe *p)
           * void arch_disarm_kprobe(struct kprobe *p)
      
      and then adds the new functions for each of the architectures that already
      implement kprobes (spar64/ppc64/i386/x86_64).
      
      I thought arch_[dis]arm_kprobe was the most descriptive of what was really
      happening, but each of the architectures already had a disarm_kprobe()
      function that was really a "disarm and do some other clean-up items as
      needed when you stumble across a recursive kprobe." So...  I took the
      liberty of changing the code that was calling disarm_kprobe() to call
      arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
      with the recursive kprobe case.
      
      So far this patch as been tested on i386, x86_64, and ppc64, but still
      needs to be tested in sparc64.
      Signed-off-by: NRusty Lynch <rusty.lynch@intel.com>
      Signed-off-by: NAnil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7e1048b1
    • H
      [PATCH] kprobes: function-return probes · b94cce92
      Hien Nguyen 提交于
      This patch adds function-return probes to kprobes for the i386
      architecture.  This enables you to establish a handler to be run when a
      function returns.
      
      1. API
      
      Two new functions are added to kprobes:
      
      	int register_kretprobe(struct kretprobe *rp);
      	void unregister_kretprobe(struct kretprobe *rp);
      
      2. Registration and unregistration
      
      2.1 Register
      
        To register a function-return probe, the user populates the following
        fields in a kretprobe object and calls register_kretprobe() with the
        kretprobe address as an argument:
      
        kp.addr - the function's address
      
        handler - this function is run after the ret instruction executes, but
        before control returns to the return address in the caller.
      
        maxactive - The maximum number of instances of the probed function that
        can be active concurrently.  For example, if the function is non-
        recursive and is called with a spinlock or mutex held, maxactive = 1
        should be enough.  If the function is non-recursive and can never
        relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
        be enough.  maxactive is used to determine how many kretprobe_instance
        objects to allocate for this particular probed function.  If maxactive <=
        0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
        NR_CPUS) else maxactive=NR_CPUS)
      
        For example:
      
          struct kretprobe rp;
          rp.kp.addr = /* entrypoint address */
          rp.handler = /*return probe handler */
          rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
          register_kretprobe(&rp);
      
        The following field may also be of interest:
      
        nmissed - Initialized to zero when the function-return probe is
        registered, and incremented every time the probed function is entered but
        there is no kretprobe_instance object available for establishing the
        function-return probe (i.e., because maxactive was set too low).
      
      2.2 Unregister
      
        To unregiter a function-return probe, the user calls
        unregister_kretprobe() with the same kretprobe object as registered
        previously.  If a probed function is running when the return probe is
        unregistered, the function will return as expected, but the handler won't
        be run.
      
      3. Limitations
      
      3.1 This patch supports only the i386 architecture, but patches for
          x86_64 and ppc64 are anticipated soon.
      
      3.2 Return probes operates by replacing the return address in the stack
          (or in a known register, such as the lr register for ppc).  This may
          cause __builtin_return_address(0), when invoked from the return-probed
          function, to return the address of the return-probes trampoline.
      
      3.3 This implementation uses the "Multiprobes at an address" feature in
          2.6.12-rc3-mm3.
      
      3.4 Due to a limitation in multi-probes, you cannot currently establish
          a return probe and a jprobe on the same function.  A patch to remove
          this limitation is being tested.
      
      This feature is required by SystemTap (http://sourceware.org/systemtap),
      and reflects ideas contributed by several SystemTap developers, including
      Will Cohen and Ananth Mavinakayanahalli.
      Signed-off-by: NHien Nguyen <hien@us.ibm.com>
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NFrederik Deweerdt <frederik.deweerdt@laposte.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b94cce92
    • A
      [PATCH] avoid resursive oopses · df164db5
      Alexander Nyberg 提交于
      Prevent recursive faults in do_exit() by leaving the task alone and wait
      for reboot.  This may allow a more graceful shutdown and possibly save the
      original oops.
      Signed-off-by: NAlexander Nyberg <alexn@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      df164db5
    • C
      [PATCH] remove duplicate get_dentry functions in various places · 5f45f1a7
      Christoph Hellwig 提交于
      Various filesystem drivers have grown a get_dentry() function that's a
      duplicate of lookup_one_len, except that it doesn't take a maximum length
      argument and doesn't check for \0 or / in the passed in filename.
      
      Switch all these places to use lookup_one_len.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Greg KH <greg@kroah.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5f45f1a7
    • J
      [PATCH] preempt_count is int - remove cast and don't assign to unsigned type · be5b4fbd
      Jesper Juhl 提交于
      In kernel/sched.c the return value from preempt_count() is cast to an int.
      That made sense when preempt_count was defined as different types on is not
      needed and should go away.  The patch removes the cast.
      
      In kernel/timer.c the return value from preempt_count() is assigned to a
      variable of type u32 and then that unsigned value is later compared to
      preempt_count().  Since preempt_count() returns an int, an int is what
      should be used to store its return value.  Storing the result in an
      unsigned 32bit integer made a tiny bit of sense back when preempt_count was
      different types on different archs, but no more - let's not play signed vs
      unsigned comparison games when we don't have to.  The patch modifies the
      code to use an int to hold the value.  While I was around that bit of code
      I also made two changes to a nearby (related) printk() - I modified it to
      specify the loglevel explicitly and also broke the line into a few pieces
      to avoid it being longer than 80 chars and clarified the text a bit.
      Signed-off-by: NJesper Juhl <juhl-lkml@dif.dk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      be5b4fbd
    • G
      [PATCH] CON_CONSDEV bit not set correctly on last console · ab4af03a
      Greg Edwards 提交于
      According to include/linux/console.h, CON_CONSDEV flag should be set on
      the last console specified on the boot command line:
      
           86 #define CON_PRINTBUFFER (1)
           87 #define CON_CONSDEV     (2) /* Last on the command line */
           88 #define CON_ENABLED     (4)
           89 #define CON_BOOT        (8)
      
      This does not currently happen if there is more than one console specified
      on the boot commandline.  Instead, it gets set on the first console on the
      command line.  This can cause problems for things like kdb that look for
      the CON_CONSDEV flag to see if the console is valid.
      
      Additionaly, it doesn't look like CON_CONSDEV is reassigned to the next
      preferred console at unregister time if the console being unregistered
      currently has that bit set.
      
      Example (from sn2 ia64):
      
      elilo vmlinuz root=<dev> console=ttyS0 console=ttySG0
      
      in this case, the flags on ttySG console struct will be 0x4 (should be
      0x6).
      
      Attached patch against bk fixes both issues for the cases I looked at.  It
      uses selected_console (which gets incremented for each console specified on
      the command line) as the indicator of which console to set CON_CONSDEV on.
      When adding the console to the list, if the previous one had CON_CONSDEV
      set, it masks it out.  Tested on ia64 and x86.
      
      The problem with the current behavior is it breaks overriding the default from
      the boot line.  In the ia64 case, there may be a global append line defining
      console=a in elilo.conf.  Then you want to boot your kernel, and want to
      override the default by passing console=b on the boot line.  elilo constructs
      the kernel cmdline by starting with the value of the global append line, then
      tacks on whatever else you specify, which puts console=b last.
      Signed-off-by: NGreg Edwards <edwardsg@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ab4af03a
    • O
      [PATCH] posix-timers: use try_to_del_timer_sync() · f972be33
      Oleg Nesterov 提交于
      sys_timer_settime/sys_timer_delete needs to delete k_itimer->real.timer
      synchronously while holding ->it_lock, which is also locked in
      posix_timer_fn.
      
      This patch removes timer_active/set_timer_inactive which plays with
      timer_list's internals in favour of using try_to_del_timer_sync(), which
      was introduced in the previous patch.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f972be33
    • O
      [PATCH] timers: introduce try_to_del_timer_sync() · fd450b73
      Oleg Nesterov 提交于
      This patch splits del_timer_sync() into 2 functions.  The new one,
      try_to_del_timer_sync(), returns -1 when it hits executing timer.
      
      It can be used in interrupt context, or when the caller hold locks which
      can prevent completion of the timer's handler.
      
      NOTE.  Currently it can't be used in interrupt context in UP case, because
      ->running_timer is used only with CONFIG_SMP.
      
      Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
      set_running_timer(), it is cheap.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fd450b73
    • O
      [PATCH] timers fixes/improvements · 55c888d6
      Oleg Nesterov 提交于
      This patch tries to solve following problems:
      
      1. del_timer_sync() is racy. The timer can be fired again after
         del_timer_sync have checked all cpus and before it will recheck
         timer_pending().
      
      2. It has scalability problems. All cpus are scanned to determine
         if the timer is running on that cpu.
      
         With this patch del_timer_sync is O(1) and no slower than plain
         del_timer(pending_timer), unless it has to actually wait for
         completion of the currently running timer.
      
         The only restriction is that the recurring timer should not use
         add_timer_on().
      
      3. The timers are not serialized wrt to itself.
      
         If CPU_0 does mod_timer(jiffies+1) while the timer is currently
         running on CPU 1, it is quite possible that local interrupt on
         CPU_0 will start that timer before it finished on CPU_1.
      
      4. The timers locking is suboptimal. __mod_timer() takes 3 locks
         at once and still requires wmb() in del_timer/run_timers.
      
         The new implementation takes 2 locks sequentially and does not
         need memory barriers.
      
      Currently ->base != NULL means that the timer is pending. In that case
      ->base.lock is used to lock the timer. __mod_timer also takes timer->lock
      because ->base can be == NULL.
      
      This patch uses timer->entry.next != NULL as indication that the timer is
      pending. So it does __list_del(), entry->next = NULL instead of list_del()
      when the timer is deleted.
      
      The ->base field is used for hashed locking only, it is initialized
      in init_timer() which sets ->base = per_cpu(tvec_bases). When the
      tvec_bases.lock is locked, it means that all timers which are tied
      to this base via timer->base are locked, and the base itself is locked
      too.
      
      So __run_timers/migrate_timers can safely modify all timers which could
      be found on ->tvX lists (pending timers).
      
      When the timer's base is locked, and the timer removed from ->entry list
      (which means that _run_timers/migrate_timers can't see this timer), it is
      possible to set timer->base = NULL and drop the lock: the timer remains
      locked.
      
      This patch adds lock_timer_base() helper, which waits for ->base != NULL,
      locks the ->base, and checks it is still the same.
      
      __mod_timer() schedules the timer on the local CPU and changes it's base.
      However, it does not lock both old and new bases at once. It locks the
      timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
      unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
      and adds this timer. This simplifies the code, because AB-BA deadlock is not
      possible. __mod_timer() also ensures that the timer's base is not changed
      while the timer's handler is running on the old base.
      
      __run_timers(), del_timer() do not change ->base anymore, they only clear
      pending flag.
      
      So del_timer_sync() can test timer->base->running_timer == timer to detect
      whether it is running or not.
      
      We don't need timer_list->lock anymore, this patch kills it.
      
      We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
      before clearing timer's pending flag. It was needed because __mod_timer()
      did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
      could race with del_timer()->list_del(). With this patch these functions are
      serialized through base->lock.
      
      One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
      adds global
      
              struct timer_base_s {
                      spinlock_t lock;
                      struct timer_list *running_timer;
              } __init_timer_base;
      
      which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
      struct are replaced by struct timer_base_s t_base.
      
      It is indeed ugly. But this can't have scalability problems. The global
      __init_timer_base.lock is used only when __mod_timer() is called for the first
      time AND the timer was compile time initialized. After that the timer migrates
      to the local CPU.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NRenaud Lienhart <renaud.lienhart@free.fr>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55c888d6
    • C
      [PATCH] i386: Selectable Frequency of the Timer Interrupt · 59121003
      Christoph Lameter 提交于
      Make the timer frequency selectable. The timer interrupt may cause bus
      and memory contention in large NUMA systems since the interrupt occurs
      on each processor HZ times per second.
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      59121003
  2. 22 6月, 2005 6 次提交
    • P
      [PATCH] uml: make hw_controller_type->release exist only for archs needing it · b77d6adc
      Paolo 'Blaisorblade' Giarrusso 提交于
      With Chris Wedgwood <cw@f00f.org>
      
      As suggested by Chris, we can make the "just added" method ->release
      conditional to UML only (better: to archs requesting it, i.e.  only UML
      currently), so that other archs don't get this unneeded crud, and if UML
      won't need it any more we can kill this.
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b77d6adc
    • P
      [PATCH] uml: add and use generic hw_controller_type->release · dbce706e
      Paolo 'Blaisorblade' Giarrusso 提交于
      With Chris Wedgwood <cw@f00f.org>
      
      Currently UML must explicitly call the UML-specific
      free_irq_by_irq_and_dev() for each free_irq call it's done.
      
      This is needed because ->shutdown and/or ->disable are only called when the
      last "action" for that irq is removed.
      
      Instead, for UML shared IRQs (UML IRQs are very often, if not always,
      shared), for each dev_id some setup is done, which must be cleared on the
      release of that fd.  For instance, for each open console a new instance
      (i.e.  new dev_id) of the same IRQ is requested().
      
      Exactly, a fd is stored in an array (pollfds), which is after read by a
      host thread and passed to poll().  Each event registered by poll() triggers
      an interrupt.  So, for each free_irq() we must remove the corresponding
      host fd from the table, which we do via this -release() method.
      
      In this patch we add an appropriate hook for this, and remove all uses of
      it by pointing the hook to the said procedure; this is safe to do since the
      said procedure.
      
      Also some cosmetic improvements are included.
      
      This is heavily based on some work by Chris Wedgwood, which however didn't
      get the patch merged for something I'd call a "misunderstanding" (the need
      for this patch wasn't cleanly explained, thus adding the generic hook was
      felt as undesirable).
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dbce706e
    • H
      [PATCH] dup_mmap: update comment on new vma · 45918e1a
      Hugh Dickins 提交于
      Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
      came in, try_to_unmap_one knows the vma without needing find_vma.  But add
      a comment to note that here vma is inserted without mmap_sem.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      45918e1a
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
    • M
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks 提交于
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      753ee728
    • I
      [PATCH] smp_processor_id() cleanup · 39c715b7
      Ingo Molnar 提交于
      This patch implements a number of smp_processor_id() cleanup ideas that
      Arjan van de Ven and I came up with.
      
      The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
      spaghetti was hard to follow both on the implementational and on the
      usage side.
      
      Some of the complexity arose from picking wrong names, some of the
      complexity comes from the fact that not all architectures defined
      __smp_processor_id.
      
      In the new code, there are two externally visible symbols:
      
       - smp_processor_id(): debug variant.
      
       - raw_smp_processor_id(): nondebug variant. Replaces all existing
         uses of _smp_processor_id() and __smp_processor_id(). Defined
         by every SMP architecture in include/asm-*/smp.h.
      
      There is one new internal symbol, dependent on DEBUG_PREEMPT:
      
       - debug_smp_processor_id(): internal debug variant, mapped to
                                   smp_processor_id().
      
      Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
      lib/smp_processor_id.c file.  All related comments got updated and/or
      clarified.
      
      I have build/boot tested the following 8 .config combinations on x86:
      
       {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
      
      I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
      architectures are untested, but should work just fine.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39c715b7
  3. 21 6月, 2005 1 次提交
  4. 18 6月, 2005 1 次提交
  5. 14 6月, 2005 1 次提交
  6. 01 6月, 2005 1 次提交
    • R
      [PATCH] flush icache in correct context · ae92ef8a
      Roman Zippel 提交于
      flush_icache_range() is used in two different situation - in binfmt_elf.c &
      co for user space mappings and module.c for kernel modules.  On m68k
      flush_icache_range() doesn't know which data to flush, as it has separate
      address spaces and the pointer argument can be valid in either address
      space.
      
      First I considered splitting flush_icache_range(), but this patch is
      simpler.  Setting the correct context gives flush_icache_range() enough
      information to flush the correct data.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ae92ef8a
  7. 29 5月, 2005 1 次提交
    • J
      [PATCH] drop note_interrupt() for per-CPU for proper scaling · b60c1f6f
      John Hawkes 提交于
      The "unhandled interrupts" catcher, note_interrupt(), increments a global
      desc->irq_count and grossly damages scaling of very large systems, e.g.,
      >192p ia64 Altix, because of this highly contented cacheline, especially
      for timer interrupts.  384p is severely crippled, and 512p is unuseable.
      
      All calls to note_interrupt() can be disabled by booting with "noirqdebug",
      but this disables the useful interrupt checking for all interrupts.
      
      I propose eliminating note_interrupt() for all per-CPU interrupts.  This
      was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
      restructuring added a call to note_interrupt() for per-CPU interrupts.
      Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
      the desc->irq_count++ increment isn't atomic (which, if done, would make
      scaling even worse).
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b60c1f6f
  8. 27 5月, 2005 2 次提交
    • P
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson 提交于
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NSimon Derr <simon.derr@bull.net>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2efe86b8
    • D
  9. 26 5月, 2005 1 次提交
    • D
      AUDIT: Defer freeing aux items until audit_free_context() · 7551ced3
      David Woodhouse 提交于
      While they were all just simple blobs it made sense to just free them
      as we walked through and logged them. Now that there are pointers to
      other objects which need refcounting, we might as well revert to
      _only_ logging them in audit_log_exit(), and put the code to free them
      properly in only one place -- in audit_free_aux().
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      ----------------------------------------------------------
      7551ced3
  10. 25 5月, 2005 1 次提交
  11. 24 5月, 2005 2 次提交
  12. 22 5月, 2005 2 次提交
    • D
      AUDIT: Assign serial number to non-syscall messages · bfb4496e
      David Woodhouse 提交于
      Move audit_serial() into audit.c and use it to generate serial numbers 
      on messages even when there is no audit context from syscall auditing.  
      This allows us to disambiguate audit records when more than one is 
      generated in the same millisecond.
      
      Based on a patch by Steve Grubb after he observed the problem.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      bfb4496e
    • S
      [PATCH] spin_unlock_bh() and preempt_check_resched() · 10f02d1c
      Samuel Thibault 提交于
      In _spin_unlock_bh(lock):
      	do { \
      		_raw_spin_unlock(lock); \
      		preempt_enable(); \
      		local_bh_enable(); \
      		__release(lock); \
      	} while (0)
      
      there is no reason for using preempt_enable() instead of a simple
      preempt_enable_no_resched()
      
      Since we know bottom halves are disabled, preempt_schedule() will always
      return at once (preempt_count!=0), and hence preempt_check_resched() is
      useless here...
      
      This fixes it by using "preempt_enable_no_resched()" instead of the
      "preempt_enable()", and thus avoids the useless preempt_check_resched()
      just before re-enabling bottom halves.
      Signed-off-by: NSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      10f02d1c