1. 24 6月, 2005 29 次提交
    • C
      [PATCH] remove <linux/xattr_acl.h> · 9a59f452
      Christoph Hellwig 提交于
      This file duplicates <linux/posix_acl_xattr.h>, using slightly different
      names.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a59f452
    • C
      [PATCH] acl endianess annotations · f9fd27a2
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f9fd27a2
    • C
      [PATCH] Remove f_error field from struct file · 45778ca8
      Christoph Lameter 提交于
      The following patch removes the f_error field and all checks of f_error.
      
      Trond said:
      
        f_error was introduced for NFS, and made sense when we were guaranteed
        always to have a file pointer around when write errors occurred.  Since
        then, we have (for various reasons) had to introduce the nfs_open_context in
        order to track the file read/write state, and it made sense to move our
        f_error tracking there too.
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Acked-by: NTrond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      45778ca8
    • A
      [PATCH] block: add unlocked_ioctl support for block devices · bb93e3a5
      Arnd Bergmann 提交于
      This patch allows block device drivers to convert their ioctl functions to
      unlocked_ioctl() like character devices and other subsystems.  All
      functions that were called with the BKL held before are still used that
      way, but I would not be surprised if it could be removed from the ioctl
      functions in drivers/block/ioctl.c themselves.
      
      As a side note, I found that compat_blkdev_ioctl() acquires the BKL as
      well, which looks like a bug.  I have checked that every user of
      disk->fops->compat_ioctl() in the current git tree gets the BKL itself, so
      it could easily be removed from compat_blkdev_ioctl().
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bb93e3a5
    • P
      [PATCH] Improve CD/DVD packet driver write performance · 46c271be
      Peter Osterlund 提交于
      This patch improves write performance for the CD/DVD packet writing driver.
       The logic for switching between reading and writing has been changed so
      that streaming writes are no longer interrupted by read requests.
      Signed-off-by: NPeter Osterlund <petero2@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46c271be
    • Y
      [PATCH] Don't force O_LARGEFILE for 32 bit processes on ia64 · ef3daeda
      Yoav Zach 提交于
      In ia64 kernel, the O_LARGEFILE flag is forced when opening a file.  This
      is problematic for execution of 32 bit processes, which are not largefile
      aware, either by SW emulation or by HW execution.
      
      For such processes, the problem is two-fold:
      
      1) When trying to open a file that is larger than 4G
         the operation should fail, but it's not
      2) Writing to offset larger than 4G should fail, but
         it's not
      
      The proposed patch takes advantage of the way 32 bit processes are
      identified in ia64 systems.  Such processes have PER_LINUX32 for their
      personality.  With the patch, the ia64 kernel will not enforce the
      O_LARGEFILE flag if the current process has PER_LINUX32 set.  The behavior
      for all other architectures remains unchanged.
      Signed-off-by: NYoav Zach <yoav.zach@intel.com>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ef3daeda
    • A
      [PATCH] setuid core dump · d6e71144
      Alan Cox 提交于
      Add a new `suid_dumpable' sysctl:
      
      This value can be used to query and set the core dump mode for setuid
      or otherwise protected/tainted binaries. The modes are
      
      0 - (default) - traditional behaviour.  Any process which has changed
          privilege levels or is execute only will not be dumped
      
      1 - (debug) - all processes dump core when possible.  The core dump is
          owned by the current user and no security is applied.  This is intended
          for system debugging situations only.  Ptrace is unchecked.
      
      2 - (suidsafe) - any binary which normally would not be dumped is dumped
          readable by root only.  This allows the end user to remove such a dump but
          not access it directly.  For security reasons core dumps in this mode will
          not overwrite one another or other files.  This mode is appropriate when
          adminstrators are attempting to debug problems in a normal environment.
      
      (akpm:
      
      > > +EXPORT_SYMBOL(suid_dumpable);
      >
      > EXPORT_SYMBOL_GPL?
      
      No problem to me.
      
      > >  	if (current->euid == current->uid && current->egid == current->gid)
      > >  		current->mm->dumpable = 1;
      >
      > Should this be SUID_DUMP_USER?
      
      Actually the feedback I had from last time was that the SUID_ defines
      should go because its clearer to follow the numbers. They can go
      everywhere (and there are lots of places where dumpable is tested/used
      as a bool in untouched code)
      
      > Maybe this should be renamed to `dump_policy' or something.  Doing that
      > would help us catch any code which isn't using the #defines, too.
      
      Fair comment. The patch was designed to be easy to maintain for Red Hat
      rather than for merging. Changing that field would create a gigantic
      diff because it is used all over the place.
      
      )
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6e71144
    • P
      [PATCH] kprobes: Temporary disarming of reentrant probe · ea32c65c
      Prasanna S Panchamukhi 提交于
      In situations where a kprobes handler calls a routine which has a probe on it,
      then kprobes_handler() disarms the new probe forever.  This patch removes the
      above limitation by temporarily disarming the new probe.  When the another
      probe hits while handling the old probe, the kprobes_handler() saves previous
      kprobes state and handles the new probe without calling the new kprobes
      registered handlers.  kprobe_post_handler() restores back the previous kprobes
      state and the normal execution continues.
      
      However on x86_64 architecture, re-rentrancy is provided only through
      pre_handler().  If a routine having probe is referenced through
      post_handler(), then the probes on that routine are disarmed forever, since
      the exception stack is gets changed after the processor single steps the
      instruction of the new probe.
      
      This patch includes generic changes to support temporary disarming on
      reentrancy of probes.
      Signed-of-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ea32c65c
    • H
      [PATCH] kprobes: moves lock-unlock to non-arch kprobe_flush_task · 0aa55e4d
      Hien Nguyen 提交于
      This patch moves the lock/unlock of the arch specific kprobe_flush_task()
      to the non-arch specific kprobe_flusk_task().
      Signed-off-by: NHien Nguyen <hien@us.ibm.com>
      Acked-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0aa55e4d
    • R
      [PATCH] Move kprobe [dis]arming into arch specific code · 7e1048b1
      Rusty Lynch 提交于
      The architecture independent code of the current kprobes implementation is
      arming and disarming kprobes at registration time.  The problem is that the
      code is assuming that arming and disarming is a just done by a simple write
      of some magic value to an address.  This is problematic for ia64 where our
      instructions look more like structures, and we can not insert break points
      by just doing something like:
      
      *p->addr = BREAKPOINT_INSTRUCTION;
      
      The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
      functions:
      
           * void arch_arm_kprobe(struct kprobe *p)
           * void arch_disarm_kprobe(struct kprobe *p)
      
      and then adds the new functions for each of the architectures that already
      implement kprobes (spar64/ppc64/i386/x86_64).
      
      I thought arch_[dis]arm_kprobe was the most descriptive of what was really
      happening, but each of the architectures already had a disarm_kprobe()
      function that was really a "disarm and do some other clean-up items as
      needed when you stumble across a recursive kprobe." So...  I took the
      liberty of changing the code that was calling disarm_kprobe() to call
      arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
      with the recursive kprobe case.
      
      So far this patch as been tested on i386, x86_64, and ppc64, but still
      needs to be tested in sparc64.
      Signed-off-by: NRusty Lynch <rusty.lynch@intel.com>
      Signed-off-by: NAnil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7e1048b1
    • H
      [PATCH] kprobes: function-return probes · b94cce92
      Hien Nguyen 提交于
      This patch adds function-return probes to kprobes for the i386
      architecture.  This enables you to establish a handler to be run when a
      function returns.
      
      1. API
      
      Two new functions are added to kprobes:
      
      	int register_kretprobe(struct kretprobe *rp);
      	void unregister_kretprobe(struct kretprobe *rp);
      
      2. Registration and unregistration
      
      2.1 Register
      
        To register a function-return probe, the user populates the following
        fields in a kretprobe object and calls register_kretprobe() with the
        kretprobe address as an argument:
      
        kp.addr - the function's address
      
        handler - this function is run after the ret instruction executes, but
        before control returns to the return address in the caller.
      
        maxactive - The maximum number of instances of the probed function that
        can be active concurrently.  For example, if the function is non-
        recursive and is called with a spinlock or mutex held, maxactive = 1
        should be enough.  If the function is non-recursive and can never
        relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
        be enough.  maxactive is used to determine how many kretprobe_instance
        objects to allocate for this particular probed function.  If maxactive <=
        0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
        NR_CPUS) else maxactive=NR_CPUS)
      
        For example:
      
          struct kretprobe rp;
          rp.kp.addr = /* entrypoint address */
          rp.handler = /*return probe handler */
          rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
          register_kretprobe(&rp);
      
        The following field may also be of interest:
      
        nmissed - Initialized to zero when the function-return probe is
        registered, and incremented every time the probed function is entered but
        there is no kretprobe_instance object available for establishing the
        function-return probe (i.e., because maxactive was set too low).
      
      2.2 Unregister
      
        To unregiter a function-return probe, the user calls
        unregister_kretprobe() with the same kretprobe object as registered
        previously.  If a probed function is running when the return probe is
        unregistered, the function will return as expected, but the handler won't
        be run.
      
      3. Limitations
      
      3.1 This patch supports only the i386 architecture, but patches for
          x86_64 and ppc64 are anticipated soon.
      
      3.2 Return probes operates by replacing the return address in the stack
          (or in a known register, such as the lr register for ppc).  This may
          cause __builtin_return_address(0), when invoked from the return-probed
          function, to return the address of the return-probes trampoline.
      
      3.3 This implementation uses the "Multiprobes at an address" feature in
          2.6.12-rc3-mm3.
      
      3.4 Due to a limitation in multi-probes, you cannot currently establish
          a return probe and a jprobe on the same function.  A patch to remove
          this limitation is being tested.
      
      This feature is required by SystemTap (http://sourceware.org/systemtap),
      and reflects ideas contributed by several SystemTap developers, including
      Will Cohen and Ananth Mavinakayanahalli.
      Signed-off-by: NHien Nguyen <hien@us.ibm.com>
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NFrederik Deweerdt <frederik.deweerdt@laposte.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b94cce92
    • C
      [PATCH] quota: consolidate code surrounding vfs_quota_on_mount · 84de856e
      Christoph Hellwig 提交于
      Move some code duplicated in both callers into vfs_quota_on_mount
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      84de856e
    • N
      [PATCH] add check to /proc/devices read routines · ac20427e
      Neil Horman 提交于
      Patch to add check to get_chrdev_list and get_blkdev_list to prevent reads
      of /proc/devices from spilling over the provided page if more than 4096
      bytes of string data are generated from all the registered character and
      block devices in a system
      Signed-off-by: NNeil Horman <nhorman@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <viro@parcelfarce.linux.theplanet.co.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ac20427e
    • N
      [PATCH] optimise loop driver a bit · 35a82d1a
      Nick Piggin 提交于
      Looks like locking can be optimised quite a lot.  Increase lock widths
      slightly so lo_lock is taken fewer times per request.  Also it was quite
      trivial to cover lo_pending with that lock, and remove the atomic
      requirement.  This also makes memory ordering explicitly correct, which is
      nice (not that I particularly saw any mem ordering bugs).
      
      Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K
      block size, 4K page size).  System is 2 socket Xeon with HT (4 thread).
      
      intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh
      
      Before:
      0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k
      0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
      0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
      0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
      0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k
      
      After:
      0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
      0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
      0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
      0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
      0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      35a82d1a
    • P
      [PATCH] create a kstrdup library function · 543537bd
      Paulo Marques 提交于
      This patch creates a new kstrdup library function and changes the "local"
      implementations in several places to use this function.
      
      Most of the changes come from the sound and net subsystems.  The sound part
      had already been acknowledged by Takashi Iwai and the net part by David S.
      Miller.
      
      I left UML alone for now because I would need more time to read the code
      carefully before making changes there.
      Signed-off-by: NPaulo Marques <pmarques@grupopie.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      543537bd
    • A
      [PATCH] fix for prune_icache()/forced final iput() races · 991114c6
      Alexander Viro 提交于
      Based on analysis and a patch from Russ Weight <rweight@us.ibm.com>
      
      There is a race condition that can occur if an inode is allocated and then
      released (using iput) during the ->fill_super functions.  The race
      condition is between kswapd and mount.
      
      For most filesystems this can only happen in an error path when kswapd is
      running concurrently.  For isofs, however, the error can occur in a more
      common code path (which is how the bug was found).
      
      The logic here is "we want final iput() to free inode *now* instead of
      letting it sit in cache if fs is going down or had not quite come up".  The
      problem is with kswapd seeing such inodes in the middle of being killed and
      happily taking over.
      
      The clean solution would be to tell kswapd to leave those inodes alone and
      let our final iput deal with them.  I.e.  add a new flag
      (I_FORCED_FREEING), set it before write_inode_now() there and make
      prune_icache() leave those alone.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      991114c6
    • O
      [PATCH] timers: introduce try_to_del_timer_sync() · fd450b73
      Oleg Nesterov 提交于
      This patch splits del_timer_sync() into 2 functions.  The new one,
      try_to_del_timer_sync(), returns -1 when it hits executing timer.
      
      It can be used in interrupt context, or when the caller hold locks which
      can prevent completion of the timer's handler.
      
      NOTE.  Currently it can't be used in interrupt context in UP case, because
      ->running_timer is used only with CONFIG_SMP.
      
      Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
      set_running_timer(), it is cheap.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fd450b73
    • O
      [PATCH] timers fixes/improvements · 55c888d6
      Oleg Nesterov 提交于
      This patch tries to solve following problems:
      
      1. del_timer_sync() is racy. The timer can be fired again after
         del_timer_sync have checked all cpus and before it will recheck
         timer_pending().
      
      2. It has scalability problems. All cpus are scanned to determine
         if the timer is running on that cpu.
      
         With this patch del_timer_sync is O(1) and no slower than plain
         del_timer(pending_timer), unless it has to actually wait for
         completion of the currently running timer.
      
         The only restriction is that the recurring timer should not use
         add_timer_on().
      
      3. The timers are not serialized wrt to itself.
      
         If CPU_0 does mod_timer(jiffies+1) while the timer is currently
         running on CPU 1, it is quite possible that local interrupt on
         CPU_0 will start that timer before it finished on CPU_1.
      
      4. The timers locking is suboptimal. __mod_timer() takes 3 locks
         at once and still requires wmb() in del_timer/run_timers.
      
         The new implementation takes 2 locks sequentially and does not
         need memory barriers.
      
      Currently ->base != NULL means that the timer is pending. In that case
      ->base.lock is used to lock the timer. __mod_timer also takes timer->lock
      because ->base can be == NULL.
      
      This patch uses timer->entry.next != NULL as indication that the timer is
      pending. So it does __list_del(), entry->next = NULL instead of list_del()
      when the timer is deleted.
      
      The ->base field is used for hashed locking only, it is initialized
      in init_timer() which sets ->base = per_cpu(tvec_bases). When the
      tvec_bases.lock is locked, it means that all timers which are tied
      to this base via timer->base are locked, and the base itself is locked
      too.
      
      So __run_timers/migrate_timers can safely modify all timers which could
      be found on ->tvX lists (pending timers).
      
      When the timer's base is locked, and the timer removed from ->entry list
      (which means that _run_timers/migrate_timers can't see this timer), it is
      possible to set timer->base = NULL and drop the lock: the timer remains
      locked.
      
      This patch adds lock_timer_base() helper, which waits for ->base != NULL,
      locks the ->base, and checks it is still the same.
      
      __mod_timer() schedules the timer on the local CPU and changes it's base.
      However, it does not lock both old and new bases at once. It locks the
      timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
      unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
      and adds this timer. This simplifies the code, because AB-BA deadlock is not
      possible. __mod_timer() also ensures that the timer's base is not changed
      while the timer's handler is running on the old base.
      
      __run_timers(), del_timer() do not change ->base anymore, they only clear
      pending flag.
      
      So del_timer_sync() can test timer->base->running_timer == timer to detect
      whether it is running or not.
      
      We don't need timer_list->lock anymore, this patch kills it.
      
      We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
      before clearing timer's pending flag. It was needed because __mod_timer()
      did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
      could race with del_timer()->list_del(). With this patch these functions are
      serialized through base->lock.
      
      One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
      adds global
      
              struct timer_base_s {
                      spinlock_t lock;
                      struct timer_list *running_timer;
              } __init_timer_base;
      
      which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
      struct are replaced by struct timer_base_s t_base.
      
      It is indeed ugly. But this can't have scalability problems. The global
      __init_timer_base.lock is used only when __mod_timer() is called for the first
      time AND the timer was compile time initialized. After that the timer migrates
      to the local CPU.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NRenaud Lienhart <renaud.lienhart@free.fr>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55c888d6
    • T
      [PATCH] blk: remove BLK_TAGS_{PER_LONG|MASK} · f7d37d02
      Tejun Heo 提交于
      Replace BLK_TAGS_PER_LONG with BITS_PER_LONG and remove unused BLK_TAGS_MASK.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Acked-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f7d37d02
    • T
      [PATCH] blk: remove blk_queue_tag->real_max_depth optimization · fa72b903
      Tejun Heo 提交于
      blk_queue_tag->real_max_depth was used to optimize out unnecessary
      allocations/frees on tag resize.  However, the whole thing was very broken -
      tag_map was never allocated to real_max_depth resulting in access beyond the
      end of the map, bits in [max_depth..real_max_depth] were set when initializing
      a map and copied when resizing resulting in pre-occupied tags.
      
      As the gain of the optimization is very small, well, almost nill, remove the
      whole thing.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Acked-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fa72b903
    • C
      [PATCH] NUMA aware block device control structure allocation · 1946089a
      Christoph Lameter 提交于
      Patch to allocate the control structures for for ide devices on the node of
      the device itself (for NUMA systems).  The patch depends on the Slab API
      change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
      posted today.
      
      Does some realignment too.
      Signed-off-by: NJustin M. Forbes <jmforbes@linuxtx.org>
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NPravin Shelar <pravin@calsoftinc.com>
      Signed-off-by: NShobhit Dayal <shobhit@calsoftinc.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1946089a
    • A
      [PATCH] sparsemem hotplug base · 29751f69
      Andy Whitcroft 提交于
      Make sparse's initalization be accessible at runtime.  This allows sparse
      mappings to be created after boot in a hotplug situation.
      
      This patch is separated from the previous one just to give an indication how
      much of the sparse infrastructure is *just* for hotplug memory.
      
      The section_mem_map doesn't really store a pointer.  It stores something that
      is convenient to do some math against to get a pointer.  It isn't valid to
      just do *section_mem_map, so I don't think it should be stored as a pointer.
      
      There are a couple of things I'd like to store about a section.  First of all,
      the fact that it is !NULL does not mean that it is present.  There could be
      such a combination where section_mem_map *is* NULL, but the math gets you
      properly to a real mem_map.  So, I don't think that check is safe.
      
      Since we're storing 32-bit-aligned structures, we have a few bits in the
      bottom of the pointer to play with.  Use one bit to encode whether there's
      really a mem_map there, and the other one to tell whether there's a valid
      section there.  We need to distinguish between the two because sometimes
      there's a gap between when a section is discovered to be present and when we
      can get the mem_map for it.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      29751f69
    • A
      [PATCH] sparsemem swiss cheese numa layouts · 641c7673
      Andy Whitcroft 提交于
      The part of the sparsemem patch which modifies memmap_init_zone() has recently
      become a problem.  It changes behavior so that there is a call to
      pfn_to_page() for each individual page inside of a node's range:
      node_start_pfn through node_end_pfn.  It used to simply do this once, at the
      beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
      of a node made it necessary to change.
      
      Mike Kravetz recently wrote a patch which made the NUMA code accept some new
      kinds of layouts.  The system's memory was laid out like this, with node 0's
      memory in two pieces: one before and one after node 1's memory:
      
      	Node 0: +++++     +++++
      	Node 1:      +++++
      
      Previous behavior before Mike's patch was to assign nodes like this:
      
      	Node 0: 00000     XXXXX
      	Node 1:      11111
      
      Where the 'X' areas were simply thrown away.  The new behavior was to make the
      pg_data_t span node 0 across all of its areas, including areas that are really
      node 1's: Node 0: 000000000000000 Node 1: 11111
      
      This wastes a little bit of mem_map space, but ends up being OK, and more
      fully utilizes the system's memory.  memmap_init_zone() initializes all of the
      "struct page"s for node 0, even for the "hole", but those never get used,
      because there is no pfn_to_page() that resolves to those pages.  However, only
      calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
      allocated for node0->node_mem_map because:
      
      	struct page *start = pfn_to_page(start_pfn);
      	// effectively start = &node->node_mem_map[0]
      	for (page = start; page < (start + size); page++) {
      		init_page_here();...
      		page++;
      	}
      
      Slow, and wasteful, but generally harmless.
      
      But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
      does):
      
      	for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
      		page = pfn_to_page(pfn);
      	}
      
      And you end up trying to initialize node 1's pages too early, along with bogus
      data from node 0.  This patch checks for those weird layouts and declines to
      touch the pages, making the more frequent pfn_to_page() calls OK to do.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      641c7673
    • A
      [PATCH] sparsemem memory model · d41dee36
      Andy Whitcroft 提交于
      Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
      mem_map[] is needed by discontiguous memory machines (like in the old
      CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
      replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
      become a complete replacement.
      
      A significant advantage over DISCONTIGMEM is that it's completely separated
      from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
      and DISCONTIG are often confused.
      
      Another advantage is that sparse doesn't require each NUMA node's ranges to be
      contiguous.  It can handle overlapping ranges between nodes with no problems,
      where DISCONTIGMEM currently throws away that memory.
      
      Sparsemem uses an array to provide different pfn_to_page() translations for
      each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
      to be chopped up.
      
      In order to do quick pfn_to_page() operations, the section number of the page
      is encoded in page->flags.  Part of the sparsemem infrastructure enables
      sharing of these bits more dynamically (at compile-time) between the
      page_zone() and sparsemem operations.  However, on 32-bit architectures, the
      number of bits is quite limited, and may require growing the size of the
      page->flags type in certain conditions.  Several things might force this to
      occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
      memory), an increase in the physical address space, or an increase in the
      number of used page->flags.
      
      One thing to note is that, once sparsemem is present, the NUMA node
      information no longer needs to be stored in the page->flags.  It might provide
      speed increases on certain platforms and will be stored there if there is
      room.  But, if out of room, an alternate (theoretically slower) mechanism is
      used.
      
      This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
      there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
      often have to compile out the same areas of code.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NMartin Bligh <mbligh@aracnet.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d41dee36
    • A
      [PATCH] generify early_pfn_to_nid · b159d43f
      Andy Whitcroft 提交于
      Provide a default implementation for early_pfn_to_nid returning node 0.  Allow
      architectures to override this with their own implementation out of
      asm/mmzone.h.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NMartin Bligh <mbligh@aracnet.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b159d43f
    • D
      [PATCH] Introduce new Kconfig option for NUMA or DISCONTIG · 93b7504e
      Dave Hansen 提交于
      There is some confusion that arose when working on SPARSEMEM patch between
      what is needed for DISCONTIG vs. NUMA.
      
      Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
      All of the current NUMA implementations require an implementation of
      DISCONTIG.  Because of this, quite a lot of code which is really needed for
      NUMA is actually under DISCONTIG #ifdefs.  For SPARSEMEM, we changed some
      of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
      case.
      
      Introducing this new NEED_MULTIPLE_NODES config option allows code that is
      needed for both NUMA or DISCONTIG to be separated out from code that is
      specific to DISCONTIG.
      
      One great advantage of this approach is that it doesn't require every
      architecture to be converted over.  All of the current implementations
      should "just work", only the ones implementing SPARSEMEM will have to be
      fixed up.
      
      The change to free_area_init() makes it work inside, or out of the new
      config option.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      93b7504e
    • D
      [PATCH] sparsemem base: reorganize page->flags bit operations · 348f8b6c
      Dave Hansen 提交于
      Generify the value fields in the page_flags.  The aim is to allow the location
      and size of these fields to be varied.  Additionally we want to move away from
      fixed allocations per field whilst still enforcing the overall bit utilisation
      limits.  We rely on the compiler to spot and optimise the accessor functions.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      348f8b6c
    • D
      [PATCH] sparsemem base: simple NUMA remap space allocator · 6f167ec7
      Dave Hansen 提交于
      Introduce a simple allocator for the NUMA remap space.  This space is very
      scarce, used for structures which are best allocated node local.
      
      This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
      the pgdat->node_mem_map initialized in a consistent place for all
      architectures.
      
      Issues:
      o alloc_remap takes a node_id where we might expect a pgdat which was intended
        to allow us to allocate the pgdat's using this mechanism; which we do not yet
        do.  Could have alloc_remap_node() and alloc_remap_nid() for this purpose.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6f167ec7
    • D
      [PATCH] remove non-DISCONTIG use of pgdat->node_mem_map · 408fde81
      Dave Hansen 提交于
      This patch effectively eliminates direct use of pgdat->node_mem_map outside
      of the DISCONTIG code.  On a flat memory system, these fields aren't
      currently used, neither are they on a sparsemem system.
      
      There was also a node_mem_map(nid) macro on many architectures.  Its use
      along with the use of ->node_mem_map itself was not consistent.  It has
      been removed in favor of two new, more explicit, arch-independent macros:
      
      	pgdat_page_nr(pgdat, pagenr)
      	nid_page_nr(nid, pagenr)
      
      I called them "pgdat" and "nid" because we overload the term "node" to mean
      "NUMA node", "DISCONTIG node" or "pg_data_t" in very confusing ways.  I
      believe the newer names are much clearer.
      
      These macros can be overridden in the sparsemem case with a theoretically
      slower operation using node_start_pfn and pfn_to_page(), instead.  We could
      make this the only behavior if people want, but I don't want to change too
      much at once.  One thing at a time.
      
      This patch removes more code than it adds.
      
      Compile tested on alpha, alpha discontig, arm, arm-discontig, i386, i386
      generic, NUMAQ, Summit, ppc64, ppc64 discontig, and x86_64.  Full list
      here: http://sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/configs/
      
      Boot tested on NUMAQ, x86 SMP and ppc64 power4/5 LPARs.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NMartin J. Bligh <mbligh@aracnet.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      408fde81
  2. 23 6月, 2005 11 次提交
    • S
      [X25]: Fast select with no restriction on response · ebc3f64b
      Shaun Pereira 提交于
      This patch is a follow up to patch 1 regarding "Selective Sub Address
      matching with call user data".  It allows use of the Fast-Select-Acceptance
      optional user facility for X.25.
      
      This patch just implements fast select with no restriction on response
      (NRR).  What this means (according to ITU-T Recomendation 10/96 section
      6.16) is that if in an incoming call packet, the relevant facility bits are
      set for fast-select-NRR, then the called DTE can issue a direct response to
      the incoming packet using a call-accepted packet that contains
      call-user-data.  This patch allows such a response.  
      
      The called DTE can also respond with a clear-request packet that contains
      call-user-data.  However, this feature is currently not implemented by the
      patch.
      
      How is Fast Select Acceptance used?
      By default, the system does not allow fast select acceptance (as before).
      To enable a response to fast select acceptance,  
      After a listen socket in created and bound as follows
      	socket(AF_X25, SOCK_SEQPACKET, 0);
      	bind(call_soc, (struct sockaddr *)&locl_addr, sizeof(locl_addr));
      but before a listen system call is made, the following ioctl should be used.
      	ioctl(call_soc,SIOCX25CALLACCPTAPPRV);
      Now the listen system call can be made
      	listen(call_soc, 4);
      After this, an incoming-call packet will be accepted, but no call-accepted 
      packet will be sent back until the following system call is made on the socket
      that accepts the call
      	ioctl(vc_soc,SIOCX25SENDCALLACCPT);
      The network (or cisco xot router used for testing here) will allow the 
      application server's call-user-data in the call-accepted packet, 
      provided the call-request was made with Fast-select NRR.
      Signed-off-by: NShaun Pereira <spereira@tusc.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebc3f64b
    • S
      [X25]: Selective sub-address matching with call user data. · cb65d506
      Shaun Pereira 提交于
      From: Shaun Pereira <spereira@tusc.com.au>
      
      This is the first (independent of the second) patch of two that I am
      working on with x25 on linux (tested with xot on a cisco router).  Details
      are as follows.
      
      Current state of module:
      
      A server using the current implementation (2.6.11.7) of the x25 module will
      accept a call request/ incoming call packet at the listening x.25 address,
      from all callers to that address, as long as NO call user data is present
      in the packet header.
      
      If the server needs to choose to accept a particular call request/ incoming
      call packet arriving at its listening x25 address, then the kernel has to
      allow a match of call user data present in the call request packet with its
      own.  This is required when multiple servers listen at the same x25 address
      and device interface.  The kernel currently matches ALL call user data, if
      present.
      
      Current Changes:
      
      This patch is a follow up to the patch submitted previously by Andrew
      Hendry, and allows the user to selectively control the number of octets of
      call user data in the call request packet, that the kernel will match.  By
      default no call user data is matched, even if call user data is present. 
      To allow call user data matching, a cudmatchlength > 0 has to be passed
      into the kernel after which the passed number of octets will be matched. 
      Otherwise the kernel behavior is exactly as the original implementation.
      
      This patch also ensures that as is normally the case, no call user data
      will be present in the Call accepted / call connected packet sent back to
      the caller 
      
      Future Changes on next patch:
      
      There are cases however when call user data may be present in the call
      accepted packet.  According to the X.25 recommendation (ITU-T 10/96)
      section 5.2.3.2 call user data may be present in the call accepted packet
      provided the fast select facility is used.  My next patch will include this
      fast select utility and the ability to send up to 128 octets call user data
      in the call accepted packet provided the fast select facility is used.  I
      am currently testing this, again with xot on linux and cisco.  
      Signed-off-by: NShaun Pereira <spereira@tusc.com.au>
      
      (With a fix from Alexey Dobriyan <adobriyan@gmail.com>)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb65d506
    • J
      [NETPOLL]: allow multiple netpoll_clients to register against one interface · fbeec2e1
      Jeff Moyer 提交于
      This patch provides support for registering multiple netpoll clients to the
      same network device.  Only one of these clients may register an rx_hook,
      however.  In practice, this restriction has not been problematic.  It is
      worth mentioning, though, that the current design can be easily extended to
      allow for the registration of multiple rx_hooks.
      
      The basic idea of the patch is that the rx_np pointer in the netpoll_info
      structure points to the struct netpoll that has rx_hook filled in.  Aside
      from this one case, there is no need for a pointer from the struct
      net_device to an individual struct netpoll.
      
      A lock is introduced to protect the setting and clearing of the np_rx
      pointer.  The pointer will only be cleared upon netpoll client module
      removal, and the lock should be uncontested.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbeec2e1
    • J
      [NETPOLL]: Introduce a netpoll_info struct · 115c1d6e
      Jeff Moyer 提交于
      This patch introduces a netpoll_info structure, which the struct net_device
      will now point to instead of pointing to a struct netpoll.  The reason for
      this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock
      should be maintained per net_device, not per netpoll;  and 2) this is a first
      step in providing support for multiple netpoll clients to register against the
      same net_device.
      
      The struct netpoll is now pointed to by the netpoll_info structure.  As
      such, the previous behaviour of the code is preserved.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      115c1d6e
    • J
      [NETPOLL]: Set poll_owner to -1 before unlocking in netpoll_poll_unlock() · 6ca4f65e
      Jeff Moyer 提交于
      This trivial patch moves the assignment of poll_owner to -1 inside of
      the lock.  This fixes a potential SMP race in the code.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ca4f65e
    • T
      [PATCH] NFSv4: Clean up nfs4 lock state accounting · 8d0a8a9d
      Trond Myklebust 提交于
       Ensure that lock owner structures are not released prematurely.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      8d0a8a9d
    • T
      [PATCH] NLM: fix a client-side race on blocking locks. · ecdbf769
      Trond Myklebust 提交于
       If the lock blocks, the server may send us a GRANTED message that
       races with the reply to our LOCK request. Make sure that we catch
       the GRANTED by queueing up our request on the nlm_blocked list
       before we send off the first LOCK rpc call.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      ecdbf769
    • T
    • T
      [PATCH] NFS: Make searching and waiting on busy writeback requests more efficient. · c6a556b8
      Trond Myklebust 提交于
       Basically copies the VFS's method for tracking writebacks and applies
       it to the struct nfs_page.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      c6a556b8
    • T
      [PATCH] NFS: Ensure that fstat() always returns the correct mtime · fe51beec
      Trond Myklebust 提交于
       Even if the file is open for writes.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      fe51beec
    • T
      [PATCH] NFS: Cleanup of caching code, and slight optimization of writes. · 7d52e862
      Trond Myklebust 提交于
       Unless we're doing O_APPEND writes, we really don't care about revalidating
       the file length. Just make sure that we catch any page cache invalidations.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      7d52e862