1. 13 12月, 2007 2 次提交
  2. 08 12月, 2007 1 次提交
    • J
      bonding: Add new layer2+3 hash for xor/802.3ad modes · 6f6652be
      Jay Vosburgh 提交于
       	Add new hash for balance-xor and 802.3ad modes.  Originally
       submitted by "Glenn Griffin" <ggriffin.kernel@gmail.com>; modified by
       Jay Vosburgh to move setting of hash policy out of line, tweak the
       documentation update and add version update to 3.2.2.
      
      	Glenn's original comment follows:
      
      Included is a patch for a new xmit_hash_policy for the bonding driver
      that selects slaves based on MAC and IP information.  This is a middle
      ground between what currently exists in the layer2 only policy and the
      layer3+4 policy.  This policy strives to be fully 802.3ad compliant by
      transmitting every packet of any particular flow over the same link.
      As documented the layer3+4 policy is not fully compliant for extreme
      cases such as ip fragmentation, so this policy is a nice compromise
      for environments that require full compliance but desire more than the
      layer2 only policy.
      Signed-off-by: N"Glenn Griffin" <ggriffin.kernel@gmail.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      6f6652be
  3. 07 12月, 2007 1 次提交
  4. 06 12月, 2007 2 次提交
    • A
      proc: fix proc_dir_entry refcounting · 5a622f2d
      Alexey Dobriyan 提交于
      Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
      Switch to usual scheme:
      * PDE is created with refcount 1
      * every de_get does +1
      * every de_put() and remove_proc_entry() do -1
      * once refcount reaches 0, PDE is freed.
      
      This elegantly fixes at least two following races (both observed) without
      introducing new locks, without abusing old locks, without spreading
      lock_kernel():
      
      1) PDE leak
      
      remove_proc_entry			de_put
      -----------------			------
      			[refcnt = 1]
      if (atomic_read(&de->count) == 0)
      					if (atomic_dec_and_test(&de->count))
      						if (de->deleted)
      							/* also not taken! */
      							free_proc_entry(de);
      else
      	de->deleted = 1;
      		[refcount=0, deleted=1]
      
      2) use after free
      
      remove_proc_entry			de_put
      -----------------			------
      			[refcnt = 1]
      
      					if (atomic_dec_and_test(&de->count))
      if (atomic_read(&de->count) == 0)
      	free_proc_entry(de);
      						/* boom! */
      						if (de->deleted)
      							free_proc_entry(de);
      
      BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
      printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
      Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c086340 #4)
      EIP: 0060:[<c10acdda>] EFLAGS: 00210097 CPU: 1
      EIP is at strnlen+0x6/0x18
      EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
      ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
       DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
      Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
             c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
             f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
      Call Trace:
       [<c10ac4f0>] vsnprintf+0x2ad/0x49b
       [<c10ac779>] vscnprintf+0x14/0x1f
       [<c1018e6b>] vprintk+0xc5/0x2f9
       [<c10379f1>] handle_fasteoi_irq+0x0/0xab
       [<c1004f44>] do_IRQ+0x9f/0xb7
       [<c117db3b>] preempt_schedule_irq+0x3f/0x5b
       [<c100264e>] need_resched+0x1f/0x21
       [<c10190ba>] printk+0x1b/0x1f
       [<c107c8ad>] de_put+0x3d/0x50
       [<c107c8f8>] proc_delete_inode+0x38/0x41
       [<c107c8c0>] proc_delete_inode+0x0/0x41
       [<c1066298>] generic_delete_inode+0x5e/0xc6
       [<c1065aa9>] iput+0x60/0x62
       [<c1063c8e>] d_kill+0x2d/0x46
       [<c1063fa9>] dput+0xdc/0xe4
       [<c10571a1>] __fput+0xb0/0xcd
       [<c1054e49>] filp_close+0x48/0x4f
       [<c1055ee9>] sys_close+0x67/0xa5
       [<c10026b6>] sysenter_past_esp+0x5f/0x85
      =======================
      Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
      EIP: [<c10acdda>] strnlen+0x6/0x18 SS:ESP 0068:f380be44
      
      Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
      module is already pinned and remove_proc_entry() can't happen => nobody
      can mark PDE deleted.
      
      Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
      never get it, it's just for proper /proc/net removal. I double checked
      CLONE_NETNS continues to work.
      
      Patch survives many hours of modprobe/rmmod/cat loops without new bugs
      which can be attributed to refcounting.
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a622f2d
    • J
      jbd: Fix assertion failure in fs/jbd/checkpoint.c · d4beaf4a
      Jan Kara 提交于
      Before we start committing a transaction, we call
      __journal_clean_checkpoint_list() to cleanup transaction's written-back
      buffers.
      
      If this call happens to remove all of them (and there were already some
      buffers), __journal_remove_checkpoint() will decide to free the transaction
      because it isn't (yet) a committing transaction and soon we fail some
      assertion - the transaction really isn't ready to be freed :).
      
      We change the check in __journal_remove_checkpoint() to free only a
      transaction in T_FINISHED state.  The locking there is subtle though (as
      everywhere in JBD ;().  We use j_list_lock to protect the check and a
      subsequent call to __journal_drop_transaction() and do the same in the end
      of journal_commit_transaction() which is the only place where a transaction
      can get to T_FINISHED state.
      
      Probably I'm too paranoid here and such locking is not really necessary -
      checkpoint lists are processed only from log_do_checkpoint() where a
      transaction must be already committed to be processed or from
      __journal_clean_checkpoint_list() where kjournald itself calls it and thus
      transaction cannot change state either.  Better be safe if something
      changes in future...
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4beaf4a
  5. 05 12月, 2007 4 次提交
    • S
      futex: fix for futex_wait signal stack corruption · ce6bd420
      Steven Rostedt 提交于
      David Holmes found a bug in the -rt tree with respect to
      pthread_cond_timedwait. After trying his test program on the latest git
      from mainline, I found the bug was there too.  The bug he was seeing
      that his test program showed, was that if one were to do a "Ctrl-Z" on a
      process that was in the pthread_cond_timedwait, and then did a "bg" on
      that process, it would return with a "-ETIMEDOUT" but early. That is,
      the timer would go off early.
      
      Looking into this, I found the source of the problem. And it is a rather
      nasty bug at that.
      
      Here's the relevant code from kernel/futex.c: (not in order in the file)
      
      [...]
      smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                                struct timespec __user *utime, u32 __user *uaddr2,
                                u32 val3)
      {
              struct timespec ts;
              ktime_t t, *tp = NULL;
              u32 val2 = 0;
              int cmd = op & FUTEX_CMD_MASK;
      
              if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                      if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                              return -EFAULT;
                      if (!timespec_valid(&ts))
                              return -EINVAL;
      
                      t = timespec_to_ktime(ts);
                      if (cmd == FUTEX_WAIT)
                              t = ktime_add(ktime_get(), t);
                      tp = &t;
              }
      [...]
              return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
      }
      
      [...]
      
      long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                      u32 __user *uaddr2, u32 val2, u32 val3)
      {
              int ret;
              int cmd = op & FUTEX_CMD_MASK;
              struct rw_semaphore *fshared = NULL;
      
              if (!(op & FUTEX_PRIVATE_FLAG))
                      fshared = &current->mm->mmap_sem;
      
              switch (cmd) {
              case FUTEX_WAIT:
                      ret = futex_wait(uaddr, fshared, val, timeout);
      
      [...]
      
      static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                            u32 val, ktime_t *abs_time)
      {
      [...]
                     struct restart_block *restart;
                      restart = &current_thread_info()->restart_block;
                      restart->fn = futex_wait_restart;
                      restart->arg0 = (unsigned long)uaddr;
                      restart->arg1 = (unsigned long)val;
                      restart->arg2 = (unsigned long)abs_time;
                      restart->arg3 = 0;
                      if (fshared)
                              restart->arg3 |= ARG3_SHARED;
                      return -ERESTART_RESTARTBLOCK;
      [...]
      
      static long futex_wait_restart(struct restart_block *restart)
      {
              u32 __user *uaddr = (u32 __user *)restart->arg0;
              u32 val = (u32)restart->arg1;
              ktime_t *abs_time = (ktime_t *)restart->arg2;
              struct rw_semaphore *fshared = NULL;
      
              restart->fn = do_no_restart_syscall;
              if (restart->arg3 & ARG3_SHARED)
                      fshared = &current->mm->mmap_sem;
              return (long)futex_wait(uaddr, fshared, val, abs_time);
      }
      
      So when the futex_wait is interrupt by a signal we break out of the
      hrtimer code and set up or return from signal. This code does not return
      back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
      save the "abs_time" which is a pointer to the stack variable "ktime_t t"
      from sys_futex.
      
      This returns and unwinds the stack before we get to call our signal. On
      return from the signal we go to futex_wait_restart, where we update all
      the parameters for futex_wait and call it. But here we have a problem
      where abs_time is no longer valid.
      
      I verified this with print statements, and sure enough, what abs_time
      was set to ends up being garbage when we get to futex_wait_restart.
      
      The solution I did to solve this (with input from Linus Torvalds)
      was to add unions to the restart_block to allow system calls to
      use the restart with specific parameters.  This way the futex code now
      saves the time in a 64bit value in the restart block instead of storing
      it on the stack.
      
      Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
      in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
      Not sure what that is there for.  If this turns out to be a problem, I've
      tested this with using "unsigned int" for u32 and "unsigned long long" for
      u64 and it worked just the same. I'm using u32 and u64 just to be
      consistent with what the futex code uses.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce6bd420
    • A
      [LRO]: fix lro_gen_skb() alignment · 621544eb
      Andrew Gallatin 提交于
      Add a field to the lro_mgr struct so that drivers can specify how much
      padding is required to align layer 3 headers when a packet is copied
      into a freshly allocated skb by inet_lro.c:lro_gen_skb().  Without
      padding, skbs generated by LRO will cause alignment warnings on
      architectures which require strict alignment (seen on sparc64).
      
      Myri10GE is updated to use this field.
      Signed-off-by: NAndrew Gallatin <gallatin@myri.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      621544eb
    • E
      Security: round mmap hint address above mmap_min_addr · 7cd94146
      Eric Paris 提交于
      If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
      non-null hint address less than mmap_min_addr the mapping will fail the
      security checks.  Since this is just a hint address this patch will round
      such a hint address above mmap_min_addr.
      
      gcj was found to try to be very frugal with vm usage and give hint addresses
      in the 8k-32k range.  Without this patch all such programs failed and with
      the patch they happily get a higher address.
      
      This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
      without it and there would be no security check possible no matter what.  So
      we should not bother compiling in this rounding if it is just a waste of
      time.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7cd94146
    • A
      PHY: Add the phy_device_release device method. · 6f4a7f41
      Anton Vorontsov 提交于
      Lately I've got this nice badness on mdio bus removal:
      
      Device 'e0103120:06' does not have a release() function, it is broken and must be fixed.
      ------------[ cut here ]------------
      Badness at drivers/base/core.c:107
      NIP: c015c1a8 LR: c015c1a8 CTR: c0157488
      REGS: c34bdcf0 TRAP: 0700   Not tainted  (2.6.23-rc5-g9ebadfbb-dirty)
      MSR: 00029032 <EE,ME,IR,DR>  CR: 24088422  XER: 00000000
      ...
      [c34bdda0] [c015c1a8] device_release+0x78/0x80 (unreliable)
      [c34bddb0] [c01354cc] kobject_cleanup+0x80/0xbc
      [c34bddd0] [c01365f0] kref_put+0x54/0x6c
      [c34bdde0] [c013543c] kobject_put+0x24/0x34
      [c34bddf0] [c015c384] put_device+0x1c/0x2c
      [c34bde00] [c0180e84] mdiobus_unregister+0x2c/0x58
      ...
      
      Though actually there is nothing broken, it just device
      subsystem core expects another "pattern" of resource managment.
      
      This patch implement phy device's release function, thus
      we're getting rid of this badness.
      
      Also small hidden bug fixed, hope none other introduced. ;-)
      Signed-off-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Acked-by: NAndy Fleming <afleming@freescale.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      6f4a7f41
  6. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  7. 02 12月, 2007 1 次提交
  8. 01 12月, 2007 1 次提交
    • E
      [NETNS]: Fix /proc/net breakage · 2b1e300a
      Eric W. Biederman 提交于
      Well I clearly goofed when I added the initial network namespace support
      for /proc/net.  Currently things work but there are odd details visible to
      user space, even when we have a single network namespace.
      
      Since we do not cache proc_dir_entry dentries at the moment we can just
      modify ->lookup to return a different directory inode depending on the
      network namespace of the process looking at /proc/net, replacing the
      current technique of using a magic and fragile follow_link method.
      
      To accomplish that this patch:
      - introduces a shadow_proc method to allow different dentries to
        be returned from proc_lookup.
      - Removes the old /proc/net follow_link magic
      - Fixes a weakness in our not caching of proc generic dentries.
      
      As shadow_proc uses a task struct to decided which dentry to return we can
      go back later and fix the proc generic caching without modifying any code
      that uses the shadow_proc method.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      2b1e300a
  9. 30 11月, 2007 7 次提交
  10. 29 11月, 2007 2 次提交
  11. 28 11月, 2007 1 次提交
  12. 27 11月, 2007 7 次提交
  13. 26 11月, 2007 1 次提交
    • H
      [SKBUFF]: Free old skb properly in skb_morph · 2d4baff8
      Herbert Xu 提交于
      The skb_morph function only freed the data part of the dst skb, but leaked
      the auxiliary data such as the netfilter fields.  This patch fixes this by
      moving the relevant parts from __kfree_skb to skb_release_all and calling
      it in skb_morph.
      
      It also makes kfree_skbmem static since it's no longer called anywhere else
      and it now no longer does skb_release_data.
      
      Thanks to Yasuyuki KOZAKAI for finding this problem and posting a patch for
      it.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      2d4baff8
  14. 24 11月, 2007 1 次提交
  15. 22 11月, 2007 1 次提交
  16. 20 11月, 2007 4 次提交
  17. 19 11月, 2007 2 次提交
  18. 17 11月, 2007 1 次提交