1. 05 12月, 2007 2 次提交
    • S
      futex: fix for futex_wait signal stack corruption · ce6bd420
      Steven Rostedt 提交于
      David Holmes found a bug in the -rt tree with respect to
      pthread_cond_timedwait. After trying his test program on the latest git
      from mainline, I found the bug was there too.  The bug he was seeing
      that his test program showed, was that if one were to do a "Ctrl-Z" on a
      process that was in the pthread_cond_timedwait, and then did a "bg" on
      that process, it would return with a "-ETIMEDOUT" but early. That is,
      the timer would go off early.
      
      Looking into this, I found the source of the problem. And it is a rather
      nasty bug at that.
      
      Here's the relevant code from kernel/futex.c: (not in order in the file)
      
      [...]
      smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                                struct timespec __user *utime, u32 __user *uaddr2,
                                u32 val3)
      {
              struct timespec ts;
              ktime_t t, *tp = NULL;
              u32 val2 = 0;
              int cmd = op & FUTEX_CMD_MASK;
      
              if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                      if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                              return -EFAULT;
                      if (!timespec_valid(&ts))
                              return -EINVAL;
      
                      t = timespec_to_ktime(ts);
                      if (cmd == FUTEX_WAIT)
                              t = ktime_add(ktime_get(), t);
                      tp = &t;
              }
      [...]
              return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
      }
      
      [...]
      
      long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                      u32 __user *uaddr2, u32 val2, u32 val3)
      {
              int ret;
              int cmd = op & FUTEX_CMD_MASK;
              struct rw_semaphore *fshared = NULL;
      
              if (!(op & FUTEX_PRIVATE_FLAG))
                      fshared = &current->mm->mmap_sem;
      
              switch (cmd) {
              case FUTEX_WAIT:
                      ret = futex_wait(uaddr, fshared, val, timeout);
      
      [...]
      
      static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                            u32 val, ktime_t *abs_time)
      {
      [...]
                     struct restart_block *restart;
                      restart = &current_thread_info()->restart_block;
                      restart->fn = futex_wait_restart;
                      restart->arg0 = (unsigned long)uaddr;
                      restart->arg1 = (unsigned long)val;
                      restart->arg2 = (unsigned long)abs_time;
                      restart->arg3 = 0;
                      if (fshared)
                              restart->arg3 |= ARG3_SHARED;
                      return -ERESTART_RESTARTBLOCK;
      [...]
      
      static long futex_wait_restart(struct restart_block *restart)
      {
              u32 __user *uaddr = (u32 __user *)restart->arg0;
              u32 val = (u32)restart->arg1;
              ktime_t *abs_time = (ktime_t *)restart->arg2;
              struct rw_semaphore *fshared = NULL;
      
              restart->fn = do_no_restart_syscall;
              if (restart->arg3 & ARG3_SHARED)
                      fshared = &current->mm->mmap_sem;
              return (long)futex_wait(uaddr, fshared, val, abs_time);
      }
      
      So when the futex_wait is interrupt by a signal we break out of the
      hrtimer code and set up or return from signal. This code does not return
      back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
      save the "abs_time" which is a pointer to the stack variable "ktime_t t"
      from sys_futex.
      
      This returns and unwinds the stack before we get to call our signal. On
      return from the signal we go to futex_wait_restart, where we update all
      the parameters for futex_wait and call it. But here we have a problem
      where abs_time is no longer valid.
      
      I verified this with print statements, and sure enough, what abs_time
      was set to ends up being garbage when we get to futex_wait_restart.
      
      The solution I did to solve this (with input from Linus Torvalds)
      was to add unions to the restart_block to allow system calls to
      use the restart with specific parameters.  This way the futex code now
      saves the time in a 64bit value in the restart block instead of storing
      it on the stack.
      
      Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
      in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
      Not sure what that is there for.  If this turns out to be a problem, I've
      tested this with using "unsigned int" for u32 and "unsigned long long" for
      u64 and it worked just the same. I'm using u32 and u64 just to be
      consistent with what the futex code uses.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce6bd420
    • A
      PHY: Add the phy_device_release device method. · 6f4a7f41
      Anton Vorontsov 提交于
      Lately I've got this nice badness on mdio bus removal:
      
      Device 'e0103120:06' does not have a release() function, it is broken and must be fixed.
      ------------[ cut here ]------------
      Badness at drivers/base/core.c:107
      NIP: c015c1a8 LR: c015c1a8 CTR: c0157488
      REGS: c34bdcf0 TRAP: 0700   Not tainted  (2.6.23-rc5-g9ebadfbb-dirty)
      MSR: 00029032 <EE,ME,IR,DR>  CR: 24088422  XER: 00000000
      ...
      [c34bdda0] [c015c1a8] device_release+0x78/0x80 (unreliable)
      [c34bddb0] [c01354cc] kobject_cleanup+0x80/0xbc
      [c34bddd0] [c01365f0] kref_put+0x54/0x6c
      [c34bdde0] [c013543c] kobject_put+0x24/0x34
      [c34bddf0] [c015c384] put_device+0x1c/0x2c
      [c34bde00] [c0180e84] mdiobus_unregister+0x2c/0x58
      ...
      
      Though actually there is nothing broken, it just device
      subsystem core expects another "pattern" of resource managment.
      
      This patch implement phy device's release function, thus
      we're getting rid of this badness.
      
      Also small hidden bug fixed, hope none other introduced. ;-)
      Signed-off-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Acked-by: NAndy Fleming <afleming@freescale.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      6f4a7f41
  2. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  3. 02 12月, 2007 1 次提交
  4. 01 12月, 2007 1 次提交
    • E
      [NETNS]: Fix /proc/net breakage · 2b1e300a
      Eric W. Biederman 提交于
      Well I clearly goofed when I added the initial network namespace support
      for /proc/net.  Currently things work but there are odd details visible to
      user space, even when we have a single network namespace.
      
      Since we do not cache proc_dir_entry dentries at the moment we can just
      modify ->lookup to return a different directory inode depending on the
      network namespace of the process looking at /proc/net, replacing the
      current technique of using a magic and fragile follow_link method.
      
      To accomplish that this patch:
      - introduces a shadow_proc method to allow different dentries to
        be returned from proc_lookup.
      - Removes the old /proc/net follow_link magic
      - Fixes a weakness in our not caching of proc generic dentries.
      
      As shadow_proc uses a task struct to decided which dentry to return we can
      go back later and fix the proc generic caching without modifying any code
      that uses the shadow_proc method.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      2b1e300a
  5. 30 11月, 2007 7 次提交
  6. 29 11月, 2007 2 次提交
  7. 28 11月, 2007 1 次提交
  8. 27 11月, 2007 7 次提交
  9. 26 11月, 2007 1 次提交
    • H
      [SKBUFF]: Free old skb properly in skb_morph · 2d4baff8
      Herbert Xu 提交于
      The skb_morph function only freed the data part of the dst skb, but leaked
      the auxiliary data such as the netfilter fields.  This patch fixes this by
      moving the relevant parts from __kfree_skb to skb_release_all and calling
      it in skb_morph.
      
      It also makes kfree_skbmem static since it's no longer called anywhere else
      and it now no longer does skb_release_data.
      
      Thanks to Yasuyuki KOZAKAI for finding this problem and posting a patch for
      it.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      2d4baff8
  10. 24 11月, 2007 1 次提交
  11. 22 11月, 2007 1 次提交
  12. 20 11月, 2007 4 次提交
  13. 19 11月, 2007 2 次提交
  14. 17 11月, 2007 1 次提交
  15. 16 11月, 2007 1 次提交
  16. 15 11月, 2007 7 次提交
    • J
      Fix 64KB blocksize in ext3 directories · 7c06a8dc
      Jan Kara 提交于
      With 64KB blocksize, a directory entry can have size 64KB which does not
      fit into 16 bits we have for entry lenght.  So we store 0xffff instead and
      convert value when read from / written to disk.  The patch also converts
      some places to use ext3_next_entry() when we are changing them anyway.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c06a8dc
    • E
      pidns: Place under CONFIG_EXPERIMENTAL · 57d5f66b
      Eric W. Biederman 提交于
      This is my trivial patch to swat innumerable little bugs with a single
      blow.
      
      After some intensive review (my apologies for not having gotten to this
      sooner) what we have looks like a good base to build on with the current
      pid namespace code but it is not complete, and it is still much to simple
      to find issues where the kernel does the wrong thing outside of the initial
      pid namespace.
      
      Until the dust settles and we are certain we have the ABI and the
      implementation is as correct as humanly possible let's keep process ID
      namespaces behind CONFIG_EXPERIMENTAL.
      
      Allowing us the option of fixing any ABI or other bugs we find as long as
      they are minor.
      
      Allowing users of the kernel to avoid those bugs simply by ensuring their
      kernel does not have support for multiple pid namespaces.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Kir Kolyshkin <kir@swsoft.com>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57d5f66b
    • B
      rtc: fall back to requesting only the ports we actually use · 9626f1f1
      Bjorn Helgaas 提交于
      Firmware like PNPBIOS or ACPI can report the address space consumed by the
      RTC.  The actual space consumed may be less than the size (RTC_IO_EXTENT)
      assumed by the RTC driver.
      
      The PNP core doesn't request resources yet, but I'd like to make it do so.
      If/when it does, the RTC_IO_EXTENT request may fail, which prevents the RTC
      driver from loading.
      
      Since we only use the RTC index and data registers at RTC_PORT(0) and
      RTC_PORT(1), we can fall back to requesting just enough space for those.
      
      If the PNP core requests resources, this results in typical I/O port usage
      like this:
      
          0070-0073 : 00:06		<-- PNP device 00:06 responds to 70-73
            0070-0071 : rtc		<-- RTC driver uses only 70-71
      
      instead of the current:
      
          0070-0077 : rtc		<-- RTC_IO_EXTENT == 8
      Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9626f1f1
    • S
      I/OAT: Add support for version 2 of ioatdma device · 7bb67c14
      Shannon Nelson 提交于
      Add support for version 2 of the ioatdma device.  This device handles
      the descriptor chain and DCA services slightly differently:
       - Instead of moving the dma descriptors between a busy and an idle chain,
         this new version uses a single circular chain so that we don't have
         rewrite the next_descriptor pointers as we add new requests, and the
         device doesn't need to re-read the last descriptor.
       - The new device has the DCA tags defined internally instead of needing
         them defined statically.
      Signed-off-by: NShannon Nelson <shannon.nelson@intel.com>
      Cc: "Williams, Dan J" <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bb67c14
    • A
      revert "Task Control Groups: example CPU accounting subsystem" · cfb52856
      Andrew Morton 提交于
      Revert 62d0df64.
      
      This was originally intended as a simple initial example of how to create a
      control groups subsystem; it wasn't intended for mainline, but I didn't make
      this clear enough to Andrew.
      
      The CFS cgroup subsystem now has better functionality for the per-cgroup usage
      accounting (based directly on CFS stats) than the "usage" status file in this
      patch, and the "load" status file is rather simplistic - although having a
      per-cgroup load average report would be a useful feature, I don't believe this
      patch actually provides it.  If it gets into the final 2.6.24 we'd probably
      have to support this interface for ever.
      
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfb52856
    • K
      hugetlb: fix i_blocks accounting · 45c682a6
      Ken Chen 提交于
      For administrative purpose, we want to query actual block usage for
      hugetlbfs file via fstat.  Currently, hugetlbfs always return 0.  Fix that
      up since kernel already has all the information to track it properly.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c682a6
    • A
      hugetlb: allow bulk updating in hugetlb_*_quota() · 9a119c05
      Adam Litke 提交于
      Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
      allow bulk updating of the sbinfo->free_blocks counter.  This will be used by
      the next patch in the series.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a119c05