1. 17 1月, 2008 1 次提交
  2. 16 1月, 2008 1 次提交
  3. 21 12月, 2007 2 次提交
  4. 19 12月, 2007 2 次提交
  5. 18 12月, 2007 14 次提交
    • A
      block: let elv_register() return void · 2fdd82bd
      Adrian Bunk 提交于
      elv_register() always returns 0, and there isn't anything it does where
      it should return an error (the only error condition is so grave that
      it's handled with a BUG_ON).
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2fdd82bd
    • N
      Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63
      Nishanth Aravamudan 提交于
      This reverts commit 54f9f80d ("hugetlb:
      Add hugetlb_dynamic_pool sysctl")
      
      Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
      sysctl is not needed, as its semantics can be expressed by 0 in the
      overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
      (pool enabled).
      
      (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      368d2c63
    • N
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan 提交于
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
    • A
      apm_event{,info}_t are userspace types · 8d936626
      Adam Jackson 提交于
      These types define the size of data read from /dev/apm_bios.  They should
      not be hidden behind #ifdef __KERNEL__.
      
      This is killing my xserver compile, apm_event_t is used in the xserver
      source.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d936626
    • A
      fix headers_install · 75527135
      Andrew Morton 提交于
      make[3]: *** No rule to make target `/usr/src/devel/include/linux/ticable.h', needed by `/usr/src/devel/usr/include/linux/ticable.h'.  Stop.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75527135
    • T
      libata: fix ATAPI draining · 140b5e59
      Tejun Heo 提交于
      With ATAPI transfer chunk size properly programmed, libata PIO HSM
      should be able to handle full spurious data chunks.  Also, it's a good
      idea to suppress trailing data warning for misc ATAPI commands as
      there can be many of them per command - for example, if the chunk size
      is 16 and the drive tries to transfer 510 bytes, there can be 31
      trailing data messages.
      
      This patch makes the following updates to libata ATAPI PIO HSM
      implementation.
      
      * Make it drain full spurious chunks.
      
      * Suppress trailing data warning message for misc commands.
      
      * Put limit on how many bytes can be drained.
      
      * If odd, round up consumed bytes and the number of bytes to be
        drained.  This gets the number of bytes to drain right for drivers
        which do 16bit PIO.
      
      This patch is partial backport of improve-ATAPI-data-xfer patchset
      pending for #upstream.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      140b5e59
    • T
      libata-acpi: implement dev->gtf_cache and evaluate _GTF right after _STM during resume · 398e0782
      Tejun Heo 提交于
      On certain implementations, _GTF evaluation depends on preceding _STM
      and both can be pretty picky about the configuration.  Using _GTM
      result cached during controller initialization satisfies the most
      neurotic _STM implementation.  However, libata evaluates _GTF after
      reset during device configuration and the hardware state can be
      different from what _GTF expects and can cause evaluation failure.
      
      This patch adds dev->gtf_cache and updates ata_dev_get_GTF() such that
      it uses the cached value if available.  Cache is cleared with a call
      to ata_acpi_clear_gtf().
      
      Because for SATA ACPI nodes _GTF must be evaluated after _SDD which
      can't be done till IDENTIFY is complete, _GTF caching from
      ata_acpi_on_resume() is used only for IDE ACPI nodes.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      398e0782
    • T
      libata-acpi: implement and use ata_acpi_init_gtm() · c05e6ff0
      Tejun Heo 提交于
      _GTM fetches currently configured transfer mode while _STM configures
      controller according to _GTM parameter and prepares transfer mode
      configuration TFs for _GTF.  In many cases _GTM and _STM
      implementations are quite brittle and can't cope with configuration
      changed by libata.
      
      libata does not depend on ATA ACPI to configure devices.  The only
      reason libata performs _GTM and _STM are to make _GTF evaluation
      succeed and libata also doesn't care about how _GTF TFs configure
      transfer mode.  It overrides that configuration anyway, so from
      libata's POV, it doesn't matter what value is feeded to _STM as long
      as evaluation succeeds for _STM and following _GTF.
      
      This patch adds dev->__acpi_init_gtm and store initial _GTM values on
      host initialization before modified by reset and mode configuration.
      If the field is valid, ata_acpi_init_gtm() returns pointer to the
      saved _GTM structure; otherwise, NULL.
      
      This saved value is used for _STM during resume and peek at
      BIOS/firmware programmed initial timing for later use.  The accessor
      is there to make building w/o ACPI easy as dev->__acpi_init doesn't
      exist if ACPI is not enabled.
      
      On driver detach, the initial BIOS configuration is restored by
      executing _STM with the initial _GTM values such that the next driver
      can also use the initial BIOS configured values.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      c05e6ff0
    • T
      libata: add more opcodes to ata.h · ce2e0abb
      Tejun Heo 提交于
      Add constants for DEVICE CONFIGURATION OVERLAY and SET_MAX to
      include/linux/ata.h.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      ce2e0abb
    • T
      libata: update ata_*_printk() macros such that level can be a variable · c2e366a1
      Tejun Heo 提交于
      Make prink helpers format @lv together rather than prepending to the
      format string as constant.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      c2e366a1
    • T
      libata-acpi: adjust constness in ata_acpi_gtm/stm() parameters · 0d02f0b2
      Tejun Heo 提交于
      * No internal function uses const ata_port.  Drop const from @ap.
      
      * Make ata_acpi_stm() copy @stm before using it and change @stm to
        const.
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      0d02f0b2
    • R
      usb.h: fix kernel-doc warning · f88ed90d
      Randy Dunlap 提交于
      Fix kernel-doc warning in usb.h:
      Warning(linux-2.6.24-rc3-git7//include/linux/usb.h:166): No description found for parameter 'sysfs_files_created'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      f88ed90d
    • D
      usb-storage: Fix devices that cannot handle 32k transfers · 33abc04f
      Doug Maxey 提交于
      When a device cannot handle the smallest previously limited transfer
      size (64 blocks) without stalling, limit the device to the amount of
      packets that fit in a platform native page.
      
      The lowest possible limit is PAGE_CACHE_SIZE, so if the device is ever
      used on a platform that has larger than 8K pages, you lose unless you
      can convince the device firmware folks to fix the issue.
      
      Cc: Mathew Dharm <mdharm-scsi@one-eyed-alien.net>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Signed-off-by: NDoug Maxey <dwm@austin.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      33abc04f
    • R
      tipar: remove obsolete module · cb8c9b6d
      Romain Liévin 提交于
      tipar: remove obsolete module
      
      The tipar character driver was used to implement bit-banging access
      to Texas Instruments parallel link cable. A user-land method now 
      exists thru PPDEV & PARPORT.
      Signed-off-by: NRomain Liévin <roms@lpg.ticalc.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      cb8c9b6d
  6. 15 12月, 2007 1 次提交
  7. 13 12月, 2007 3 次提交
    • B
      ide: DMA reporting and validity checking fixes (take 3) · 3ab7efe8
      Bartlomiej Zolnierkiewicz 提交于
      * ide_xfer_verbose() fixups:
        - beautify returned mode names
        - fix PIO5 reporting
        - make it return 'const char *'
      
      * Change printk() level from KERN_DEBUG to KERN_INFO in ide_find_dma_mode().
      
      * Add ide_id_dma_bug() helper based on ide_dma_verbose() to check for invalid
        DMA info in identify block.
      
      * Use ide_id_dma_bug() in ide_tune_dma() and ide_driveid_update().
      
        As a result DMA won't be tuned or will be disabled after tuning if device
        reports inconsistent info about enabled DMA mode (ide_dma_verbose() does the
        same checks while the IDE device is probed by ide-{cd,disk} device driver).
      
      * Remove no longer needed ide_dma_verbose().
      
      This patch should fix the following problem with out-of-sync IDE messages
      reported by Nick Warne:
      
             hdd: ATAPI 48X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache<7>hdd:
             skipping word 93 validity check
              , UDMA(66)
      
      and later debugged by Mark Lord to be caused by:
      
              ide_dma_verbose()
                      printk( ... "2048kB Cache");
              eighty_ninty_three()
                      printk(KERN_DEBUG "%s: skipping word 93 validity check\n");
              ide_dma_verbose()
                      printk(", UDMA(66)"
      
      Please note that as a result ide-{cd,disk} device drivers won't report the
      DMA speed used but this is intended since now DMA mode being used is always
      reported by IDE core code.
      
      v2:
      * fixes suggested by Randy:
        - use KERN_CONT for printk()-s in ide-{cd,disk}.c
        - don't remove argument name from ide_xfer_verbose() declaration
      
      v3:
      * Remove incorrect check for (id->field_valid & 1) from ide_id_dma_bug()
        (spotted by Sergei).
      
      * "XFER SLOW" -> "PIO SLOW" in ide_xfer_verbose() (suggested by Sergei).
      
      * Fix ide_find_dma_mode() to report the correct mode ('mode' after being
        limited by 'req_mode').
      
      Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
      Cc: Nick Warne <nick@ukfsn.org>
      Cc: Mark Lord <lkml@rtr.ca>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      3ab7efe8
    • N
      mmc: remove unused 'mode' from the mmc_host structure · cc3000e4
      Nicolas Pitre 提交于
      This field and corresponding defines are simply never used anywhere
      in the code.  But its mere presence is enough to confuse some host
      driver authors who attempt to rely on it.  Let's eliminate the
      possibility for confusion and remove it entirely.
      Signed-off-by: NNicolas Pitre <nico@cam.org>
      Signed-off-by: NPierre Ossman <drzeus@drzeus.cx>
      cc3000e4
    • P
      sdhci: support JMicron JMB38x chips · 84c46a53
      Pierre Ossman 提交于
      The JMicron JMB38x chip doesn't support transfers that aren't 32-bit
      aligned (both size and start address). It also doesn't like switching
      between PIO and DMA mode, so it needs to be reset after each request.
      Signed-off-by: NPierre Ossman <drzeus@drzeus.cx>
      84c46a53
  8. 11 12月, 2007 1 次提交
  9. 08 12月, 2007 1 次提交
    • J
      bonding: Add new layer2+3 hash for xor/802.3ad modes · 6f6652be
      Jay Vosburgh 提交于
       	Add new hash for balance-xor and 802.3ad modes.  Originally
       submitted by "Glenn Griffin" <ggriffin.kernel@gmail.com>; modified by
       Jay Vosburgh to move setting of hash policy out of line, tweak the
       documentation update and add version update to 3.2.2.
      
      	Glenn's original comment follows:
      
      Included is a patch for a new xmit_hash_policy for the bonding driver
      that selects slaves based on MAC and IP information.  This is a middle
      ground between what currently exists in the layer2 only policy and the
      layer3+4 policy.  This policy strives to be fully 802.3ad compliant by
      transmitting every packet of any particular flow over the same link.
      As documented the layer3+4 policy is not fully compliant for extreme
      cases such as ip fragmentation, so this policy is a nice compromise
      for environments that require full compliance but desire more than the
      layer2 only policy.
      Signed-off-by: N"Glenn Griffin" <ggriffin.kernel@gmail.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      6f6652be
  10. 07 12月, 2007 1 次提交
  11. 06 12月, 2007 2 次提交
    • A
      proc: fix proc_dir_entry refcounting · 5a622f2d
      Alexey Dobriyan 提交于
      Creating PDEs with refcount 0 and "deleted" flag has problems (see below).
      Switch to usual scheme:
      * PDE is created with refcount 1
      * every de_get does +1
      * every de_put() and remove_proc_entry() do -1
      * once refcount reaches 0, PDE is freed.
      
      This elegantly fixes at least two following races (both observed) without
      introducing new locks, without abusing old locks, without spreading
      lock_kernel():
      
      1) PDE leak
      
      remove_proc_entry			de_put
      -----------------			------
      			[refcnt = 1]
      if (atomic_read(&de->count) == 0)
      					if (atomic_dec_and_test(&de->count))
      						if (de->deleted)
      							/* also not taken! */
      							free_proc_entry(de);
      else
      	de->deleted = 1;
      		[refcount=0, deleted=1]
      
      2) use after free
      
      remove_proc_entry			de_put
      -----------------			------
      			[refcnt = 1]
      
      					if (atomic_dec_and_test(&de->count))
      if (atomic_read(&de->count) == 0)
      	free_proc_entry(de);
      						/* boom! */
      						if (de->deleted)
      							free_proc_entry(de);
      
      BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
      printing eip: c10acdda *pdpt = 00000000338f8001 *pde = 0000000000000000
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: af_packet ipv6 cpufreq_ondemand loop serio_raw psmouse k8temp hwmon sr_mod cdrom
      Pid: 23161, comm: cat Not tainted (2.6.24-rc2-8c086340 #4)
      EIP: 0060:[<c10acdda>] EFLAGS: 00210097 CPU: 1
      EIP is at strnlen+0x6/0x18
      EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 6b6b6b6b EDX: fffffffe
      ESI: c128fa3b EDI: f380bf34 EBP: ffffffff ESP: f380be44
       DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      Process cat (pid: 23161, ti=f380b000 task=f38f2570 task.ti=f380b000)
      Stack: c10ac4f0 00000278 c12ce000 f43cd2a8 00000163 00000000 7da86067 00000400
             c128fa20 00896b18 f38325a8 c128fe20 ffffffff 00000000 c11f291e 00000400
             f75be300 c128fa20 f769c9a0 c10ac779 f380bf34 f7bfee70 c1018e6b f380bf34
      Call Trace:
       [<c10ac4f0>] vsnprintf+0x2ad/0x49b
       [<c10ac779>] vscnprintf+0x14/0x1f
       [<c1018e6b>] vprintk+0xc5/0x2f9
       [<c10379f1>] handle_fasteoi_irq+0x0/0xab
       [<c1004f44>] do_IRQ+0x9f/0xb7
       [<c117db3b>] preempt_schedule_irq+0x3f/0x5b
       [<c100264e>] need_resched+0x1f/0x21
       [<c10190ba>] printk+0x1b/0x1f
       [<c107c8ad>] de_put+0x3d/0x50
       [<c107c8f8>] proc_delete_inode+0x38/0x41
       [<c107c8c0>] proc_delete_inode+0x0/0x41
       [<c1066298>] generic_delete_inode+0x5e/0xc6
       [<c1065aa9>] iput+0x60/0x62
       [<c1063c8e>] d_kill+0x2d/0x46
       [<c1063fa9>] dput+0xdc/0xe4
       [<c10571a1>] __fput+0xb0/0xcd
       [<c1054e49>] filp_close+0x48/0x4f
       [<c1055ee9>] sys_close+0x67/0xa5
       [<c10026b6>] sysenter_past_esp+0x5f/0x85
      =======================
      Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89 d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 c3 90 90 90 57 83 c9
      EIP: [<c10acdda>] strnlen+0x6/0x18 SS:ESP 0068:f380be44
      
      Also, remove broken usage of ->deleted from reiserfs: if sget() succeeds,
      module is already pinned and remove_proc_entry() can't happen => nobody
      can mark PDE deleted.
      
      Dummy proc root in netns code is not marked with refcount 1. AFAICS, we
      never get it, it's just for proper /proc/net removal. I double checked
      CLONE_NETNS continues to work.
      
      Patch survives many hours of modprobe/rmmod/cat loops without new bugs
      which can be attributed to refcounting.
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a622f2d
    • J
      jbd: Fix assertion failure in fs/jbd/checkpoint.c · d4beaf4a
      Jan Kara 提交于
      Before we start committing a transaction, we call
      __journal_clean_checkpoint_list() to cleanup transaction's written-back
      buffers.
      
      If this call happens to remove all of them (and there were already some
      buffers), __journal_remove_checkpoint() will decide to free the transaction
      because it isn't (yet) a committing transaction and soon we fail some
      assertion - the transaction really isn't ready to be freed :).
      
      We change the check in __journal_remove_checkpoint() to free only a
      transaction in T_FINISHED state.  The locking there is subtle though (as
      everywhere in JBD ;().  We use j_list_lock to protect the check and a
      subsequent call to __journal_drop_transaction() and do the same in the end
      of journal_commit_transaction() which is the only place where a transaction
      can get to T_FINISHED state.
      
      Probably I'm too paranoid here and such locking is not really necessary -
      checkpoint lists are processed only from log_do_checkpoint() where a
      transaction must be already committed to be processed or from
      __journal_clean_checkpoint_list() where kjournald itself calls it and thus
      transaction cannot change state either.  Better be safe if something
      changes in future...
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4beaf4a
  12. 05 12月, 2007 4 次提交
    • S
      futex: fix for futex_wait signal stack corruption · ce6bd420
      Steven Rostedt 提交于
      David Holmes found a bug in the -rt tree with respect to
      pthread_cond_timedwait. After trying his test program on the latest git
      from mainline, I found the bug was there too.  The bug he was seeing
      that his test program showed, was that if one were to do a "Ctrl-Z" on a
      process that was in the pthread_cond_timedwait, and then did a "bg" on
      that process, it would return with a "-ETIMEDOUT" but early. That is,
      the timer would go off early.
      
      Looking into this, I found the source of the problem. And it is a rather
      nasty bug at that.
      
      Here's the relevant code from kernel/futex.c: (not in order in the file)
      
      [...]
      smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                                struct timespec __user *utime, u32 __user *uaddr2,
                                u32 val3)
      {
              struct timespec ts;
              ktime_t t, *tp = NULL;
              u32 val2 = 0;
              int cmd = op & FUTEX_CMD_MASK;
      
              if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                      if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                              return -EFAULT;
                      if (!timespec_valid(&ts))
                              return -EINVAL;
      
                      t = timespec_to_ktime(ts);
                      if (cmd == FUTEX_WAIT)
                              t = ktime_add(ktime_get(), t);
                      tp = &t;
              }
      [...]
              return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
      }
      
      [...]
      
      long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                      u32 __user *uaddr2, u32 val2, u32 val3)
      {
              int ret;
              int cmd = op & FUTEX_CMD_MASK;
              struct rw_semaphore *fshared = NULL;
      
              if (!(op & FUTEX_PRIVATE_FLAG))
                      fshared = &current->mm->mmap_sem;
      
              switch (cmd) {
              case FUTEX_WAIT:
                      ret = futex_wait(uaddr, fshared, val, timeout);
      
      [...]
      
      static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                            u32 val, ktime_t *abs_time)
      {
      [...]
                     struct restart_block *restart;
                      restart = &current_thread_info()->restart_block;
                      restart->fn = futex_wait_restart;
                      restart->arg0 = (unsigned long)uaddr;
                      restart->arg1 = (unsigned long)val;
                      restart->arg2 = (unsigned long)abs_time;
                      restart->arg3 = 0;
                      if (fshared)
                              restart->arg3 |= ARG3_SHARED;
                      return -ERESTART_RESTARTBLOCK;
      [...]
      
      static long futex_wait_restart(struct restart_block *restart)
      {
              u32 __user *uaddr = (u32 __user *)restart->arg0;
              u32 val = (u32)restart->arg1;
              ktime_t *abs_time = (ktime_t *)restart->arg2;
              struct rw_semaphore *fshared = NULL;
      
              restart->fn = do_no_restart_syscall;
              if (restart->arg3 & ARG3_SHARED)
                      fshared = &current->mm->mmap_sem;
              return (long)futex_wait(uaddr, fshared, val, abs_time);
      }
      
      So when the futex_wait is interrupt by a signal we break out of the
      hrtimer code and set up or return from signal. This code does not return
      back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
      save the "abs_time" which is a pointer to the stack variable "ktime_t t"
      from sys_futex.
      
      This returns and unwinds the stack before we get to call our signal. On
      return from the signal we go to futex_wait_restart, where we update all
      the parameters for futex_wait and call it. But here we have a problem
      where abs_time is no longer valid.
      
      I verified this with print statements, and sure enough, what abs_time
      was set to ends up being garbage when we get to futex_wait_restart.
      
      The solution I did to solve this (with input from Linus Torvalds)
      was to add unions to the restart_block to allow system calls to
      use the restart with specific parameters.  This way the futex code now
      saves the time in a 64bit value in the restart block instead of storing
      it on the stack.
      
      Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
      in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
      Not sure what that is there for.  If this turns out to be a problem, I've
      tested this with using "unsigned int" for u32 and "unsigned long long" for
      u64 and it worked just the same. I'm using u32 and u64 just to be
      consistent with what the futex code uses.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce6bd420
    • A
      [LRO]: fix lro_gen_skb() alignment · 621544eb
      Andrew Gallatin 提交于
      Add a field to the lro_mgr struct so that drivers can specify how much
      padding is required to align layer 3 headers when a packet is copied
      into a freshly allocated skb by inet_lro.c:lro_gen_skb().  Without
      padding, skbs generated by LRO will cause alignment warnings on
      architectures which require strict alignment (seen on sparc64).
      
      Myri10GE is updated to use this field.
      Signed-off-by: NAndrew Gallatin <gallatin@myri.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      621544eb
    • E
      Security: round mmap hint address above mmap_min_addr · 7cd94146
      Eric Paris 提交于
      If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
      non-null hint address less than mmap_min_addr the mapping will fail the
      security checks.  Since this is just a hint address this patch will round
      such a hint address above mmap_min_addr.
      
      gcj was found to try to be very frugal with vm usage and give hint addresses
      in the 8k-32k range.  Without this patch all such programs failed and with
      the patch they happily get a higher address.
      
      This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
      without it and there would be no security check possible no matter what.  So
      we should not bother compiling in this rounding if it is just a waste of
      time.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7cd94146
    • A
      PHY: Add the phy_device_release device method. · 6f4a7f41
      Anton Vorontsov 提交于
      Lately I've got this nice badness on mdio bus removal:
      
      Device 'e0103120:06' does not have a release() function, it is broken and must be fixed.
      ------------[ cut here ]------------
      Badness at drivers/base/core.c:107
      NIP: c015c1a8 LR: c015c1a8 CTR: c0157488
      REGS: c34bdcf0 TRAP: 0700   Not tainted  (2.6.23-rc5-g9ebadfbb-dirty)
      MSR: 00029032 <EE,ME,IR,DR>  CR: 24088422  XER: 00000000
      ...
      [c34bdda0] [c015c1a8] device_release+0x78/0x80 (unreliable)
      [c34bddb0] [c01354cc] kobject_cleanup+0x80/0xbc
      [c34bddd0] [c01365f0] kref_put+0x54/0x6c
      [c34bdde0] [c013543c] kobject_put+0x24/0x34
      [c34bddf0] [c015c384] put_device+0x1c/0x2c
      [c34bde00] [c0180e84] mdiobus_unregister+0x2c/0x58
      ...
      
      Though actually there is nothing broken, it just device
      subsystem core expects another "pattern" of resource managment.
      
      This patch implement phy device's release function, thus
      we're getting rid of this badness.
      
      Also small hidden bug fixed, hope none other introduced. ;-)
      Signed-off-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Acked-by: NAndy Fleming <afleming@freescale.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      6f4a7f41
  13. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  14. 02 12月, 2007 1 次提交
  15. 01 12月, 2007 1 次提交
    • E
      [NETNS]: Fix /proc/net breakage · 2b1e300a
      Eric W. Biederman 提交于
      Well I clearly goofed when I added the initial network namespace support
      for /proc/net.  Currently things work but there are odd details visible to
      user space, even when we have a single network namespace.
      
      Since we do not cache proc_dir_entry dentries at the moment we can just
      modify ->lookup to return a different directory inode depending on the
      network namespace of the process looking at /proc/net, replacing the
      current technique of using a magic and fragile follow_link method.
      
      To accomplish that this patch:
      - introduces a shadow_proc method to allow different dentries to
        be returned from proc_lookup.
      - Removes the old /proc/net follow_link magic
      - Fixes a weakness in our not caching of proc generic dentries.
      
      As shadow_proc uses a task struct to decided which dentry to return we can
      go back later and fix the proc generic caching without modifying any code
      that uses the shadow_proc method.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      2b1e300a
  16. 30 11月, 2007 4 次提交