1. 10 5月, 2007 40 次提交
    • L
      Revert "ACPICA: fix AML mutex re-entrancy" · 262a7a28
      Len Brown 提交于
      This reverts commit c0d127b5.
      
      These changes to AML locking were made to allow
      Notify handlers to be called on the stack
      and not deadlock.  However, that scheme turns
      out to be flawed and was reverted by the previous commit,
      so this commit restores the locking to it previous design.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      262a7a28
    • L
      Revert "ACPICA: revert "acpi_serialize" changes" · 4d2acd9e
      Len Brown 提交于
      This reverts commit a8f4af6d.
      Thus restoring ACPICA's new acpi_serialize code.
      
      This commit by itself may cause a regression, but
      it is reverted in this order so that subsequent
      reverts reverts under this one can be made
      without conflict.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      4d2acd9e
    • L
      Revert "md: improve partition detection in md array" · 44ce6294
      Linus Torvalds 提交于
      This reverts commit 5b479c91.
      
      Quoth Neil Brown:
      
        "It causes an oops when auto-detecting raid arrays, and it doesn't
         seem easy to fix.
      
         The array may not be 'open' when do_md_run is called, so
         bdev->bd_disk might be NULL, so bd_set_size can oops.
      
         This whole approach of opening an md device before it has been
         assembled just seems to get more and more painful.  I think I'm going
         to have to come up with something clever to provide both backward
         comparability with usage expectation, and sane integration into the
         rest of the kernel."
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44ce6294
    • B
      ide: legacy PCI bus order probing fixes · 6d208b39
      Bartlomiej Zolnierkiewicz 提交于
      IDE PCI host drivers should register themselves with IDE core only when
      IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI
      host drivers are also modular) the code has no effect and just complicates
      the probing.
      
      Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when
      needed and invisible to the user) and covering by #ifdef/#endif the code
      in question.  It turned out that "ide=reverse" was silently accepted but did
      nothing in case when IDE driver was modular, this is fixed now.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      6d208b39
    • B
      ide: add ide_proc_register_port() · 5cbf79cd
      Bartlomiej Zolnierkiewicz 提交于
      * create_proc_ide_interfaces() tries to add /proc entries for every probed
        and initialized IDE port, replace it by ide_proc_register_port() which does
        it only for the given port (also rename destroy_proc_ide_interface() to
        ide_proc_unregister_port() for consistency)
        
      * convert {create,destroy}_proc_ide_interface[s]() users to use new functions
      
      * pmac driver depended on proc_ide_create() to add /proc port entries, fix it
        
      * au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic
        driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them
      
      * there is now no need to add /proc entries for IDE ports in proc_ide_create()
        so don't do it
      
      * proc_ide_create() needs now to be called before drivers are probed - fix it,
        while at it make proc_ide_create() create /proc "ide" directory
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      5cbf79cd
    • B
      ide: add "initializing" argument to ide_register_hw() · 869c56ee
      Bartlomiej Zolnierkiewicz 提交于
      Add "initializing" argument to ide_register_hw() and use it instead of ide.c
      wide variable of the same name.  Update all users of ide_register_hw()
      accordingly.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      869c56ee
    • B
      ide: cable detection fixes (take 2) · 7f8f48af
      Bartlomiej Zolnierkiewicz 提交于
      Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough
      review of the cable detection code...
      
      * print user-friendly warning about limiting the maximum transfer speed
        to UDMA33 (and the reason behind it) when 80-wire cable is not detected,
        also while at it cleanup eighty_ninty_three() a bit
      
      * use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs:
        - bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1
          were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case
          (please see FIXME comment in eighty_ninty_three() for more details)
        - CONFIG_IDEDMA_IVB=y/n cases were interchanged
        - check for SATA devices was missing
      
      * remove private cable warnings from pdc_202xx{old,new} drivers now that core
        code provides this functionality (plus, in pdc202xx_new case the test could
        give false warnings for ATAPI devices because pdc202xx_new driver doesn't
        even support ATAPI DMA)
      
      Cc: Tejun Heo <htejun@gmail.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      7f8f48af
    • B
      ide: move IDE settings handling to ide-proc.c · 7662d046
      Bartlomiej Zolnierkiewicz 提交于
      * move
      	__ide_add_setting()
      	ide_add_setting()
      	__ide_remove_setting()
      	auto_remove_settings()
      	ide_find_setting_by_name()
      	ide_read_setting()
      	ide_write_setting()
      	set_xfer_rate()
      	ide_add_generic_settings()
      	ide_register_subdriver()
      	ide_unregister_subdriver()
      
        from ide.c to ide-proc.c
      
      * set_{io_32bit,pio_mode,using_dma}() cannot be marked static now, fix it
      
      * rename ide_[un]register_subdriver() to ide_proc_[un]register_driver(),
        update device drivers to use new names
      
      * add CONFIG_IDE_PROC_FS=n versions of ide_proc_[un]register_driver()
        and ide_add_generic_settings()
      
      * make ide_find_setting_by_name(), ide_{read,write}_setting()
        and ide_{add,remove}_proc_entries() static
      
      * cover IDE settings code in device drivers with CONFIG_IDE_PROC_FS #ifdef,
        also while at it cover with CONFIG_IDE_PROC_FS #ifdef ide_driver_t.proc
      
      * remove bogus comment from ide.h
      
      * cover with CONFIG_IDE_PROC_FS #ifdef .proc and .settings in ide_drive_t
      
      Besides saner code this patch results in the IDE core smaller by ~2 kB
      (on x86-32) and IDE disk driver by ~1 kB (ditto) when CONFIG_IDE_PROC_FS=n.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      7662d046
    • B
      ide: split off ioctl handling from IDE settings (v2) · 1497943e
      Bartlomiej Zolnierkiewicz 提交于
      * do write permission and min/max checks in ide_procset_t functions
      
      * ide-disk.c: drive->id is always available so cleanup "multcount" setting
        accordingly
      
      * ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA,
        fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field,
        the bug didn't trigger because this IDE setting uses custom ->set function
      
      * ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl
      
      * ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl
      
      * handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl()
        instead of using IDE settings to deal with them
      
      * remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl
        fields from ide_settings_t, also remove now unused TYPE_INTA handling
      
      v2:
      * add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      1497943e
    • B
      ide: make /proc/ide/ optional · ecfd80e4
      Bartlomiej Zolnierkiewicz 提交于
      All important information/features should be already available through
      sysfs and ioctl interfaces.
      
      Add CONFIG_IDE_PROC_FS (CONFIG_SCSI_PROC_FS rip-off) config option,
      disabling it makes IDE driver ~5 kB smaller (on x86-32).
      
      While at it add CONFIG_PROC_FS=n versions of proc_ide_{create,destroy}()
      and remove no longer needed #ifdefs.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      ecfd80e4
    • B
      ide: add ide_tune_dma() helper · 29e744d0
      Bartlomiej Zolnierkiewicz 提交于
      After reworking the code responsible for selecting the best DMA
      transfer mode it is now possible to add generic ide_tune_dma() helper.
      
      Convert some IDE PCI host drivers to use it (the ones left need more work).
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      29e744d0
    • B
      ide: rework the code for selecting the best DMA transfer mode (v3) · 2d5eaa6d
      Bartlomiej Zolnierkiewicz 提交于
      Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch.
      
      * add ide_hwif_t.udma_filter hook for filtering UDMA mask
        (use it in alim15x3, hpt366, siimage and serverworks drivers)
      * add ide_max_dma_mode() for finding best DMA mode for the device
        (loosely based on some older libata-core.c code)
      * convert ide_dma_speed() users to use ide_max_dma_mode()
      * make ide_rate_filter() take "ide_drive_t *drive" as an argument instead
        of "u8 mode" and teach it to how to use UDMA mask to do filtering
      * use ide_rate_filter() in hpt366 driver
      * remove no longer needed ide_dma_speed() and *_ratemask()
      * unexport eighty_ninty_three()
      
      v2:
      * rename ->filter_udma_mask to ->udma_filter
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      
      v3:
      * updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space
        originated transfer mode change requests when 100MHz clock is used)
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      2d5eaa6d
    • B
      ide: fix UDMA/MWDMA/SWDMA masks (v3) · 18137207
      Bartlomiej Zolnierkiewicz 提交于
      * use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask
      * add udma_mask field to ide_pci_device_t and use it to initialize
        ->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers
      * fix UDMA masks to match with chipset specific *_ratemask()
        (alim15x3, hpt366, serverworks and siimage drivers need UDMA mask
         filtering method - done in the next patch)
      
      v2:
      * piix: fix cable detection for 82801AA_1 and 82372FB_1
        [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      * cmd64x: use hwif->cds->udma_mask
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      * aec62xx: fix newly introduced bug - check DMA status not command register
        [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      
      v3:
      * piix: use hwif->cds->udma_mask
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      18137207
    • H
      i386: msr.h: be paranoid about types and parentheses · b0b73cb4
      H. Peter Anvin 提交于
      When implementing things as macros, make sure we use typecasts and
      parentheses where needed.  The macros as defined were vulnerable to
      surreptitious promotion causing problems.
      
      Avoid macros where practical; e.g. wrmsr() can be an inline instead.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0b73cb4
    • H
      i386: remove unused rdtsc() macro · 29bd4433
      H. Peter Anvin 提交于
      All users to the two-part rdtsc() macro have already switched to using
      rdtscl() or rdtscll().  Remove the now-obsolete macro.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29bd4433
    • N
      md: improve partition detection in md array · 5b479c91
      NeilBrown 提交于
      md currently uses ->media_changed to make sure rescan_partitions
      is call on md array after they are assembled.
      
      However that doesn't happen until the array is opened, which is later
      than some people would like.
      
      So use blkdev_ioctl to do the rescan immediately that the
      array has been assembled.
      
      This means we can remove all the ->change infrastructure as it was only used
      to trigger a partition rescan.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b479c91
    • H
      fbdev: add support for AVR32 · 880169dd
      Haavard Skinnemoen 提交于
      Provide framebuffer page protection flags and definitions of
      fb_readl/fb_writel for AVR32.
      Signed-off-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      880169dd
    • A
      svgalib: move fb_get_caps to svgalib · 5a87ede9
      Antonino A. Daplas 提交于
      Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by
      s3fb, arkfb and vt8623fb.
      Signed-off-by: NAntonino Daplas <adaplas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a87ede9
    • D
      i386 mmzone: use __maybe_unused · c3c117f0
      David Rientjes 提交于
      Replace automatic variable instances of __attribute__ ((unused)) with
      __maybe_unused.
      
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3c117f0
    • D
      sh: dma: use __maybe_unused · d16aaffa
      David Rientjes 提交于
      There is no such thing as labeling a variable as __attribute__((used)).  Since
      ts_shift is not referenced in inline assembly, we assume that we're simply
      suppressing a warning here if the variable is declared but unreferenced.
      
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d16aaffa
    • D
      compiler: introduce __used and __maybe_unused · 0d7ebbbc
      David Rientjes 提交于
      __used is defined to be __attribute__((unused)) for all pre-3.3 gcc
      compilers to suppress warnings for unused functions because perhaps they
      are referenced only in inline assembly.  It is defined to be
      __attribute__((used)) for gcc 3.3 and later so that the code is still
      emitted for such functions.
      
      __maybe_unused is defined to be __attribute__((unused)) for both function
      and variable use if it could possibly be unreferenced due to the evaluation
      of preprocessor macros.  Function prototypes shall be marked with
      __maybe_unused if the actual definition of the function is dependant on
      preprocessor macros.
      
      No update to compiler-intel.h is necessary because ICC supports both
      __attribute__((used)) and __attribute__((unused)) as specified by the gcc
      manual.
      
      __attribute_used__ is deprecated and will be removed once all current
      code is converted to using __used.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@stusta.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d7ebbbc
    • R
      rename thread_info to stack · f7e4217b
      Roman Zippel 提交于
      This finally renames the thread_info field in task structure to stack, so that
      the assumptions about this field are gone and archs have more freedom about
      placing the thread_info structure.
      
      Nonbroken archs which have a proper thread pointer can do the access to both
      current thread and task structure via a single pointer.
      
      It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
      could benefit.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      [akpm@linux-foundation.org: build fix]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e4217b
    • R
      wrap access to thread_info · c9f4f06d
      Roman Zippel 提交于
      Recently a few direct accesses to the thread_info in the task structure snuck
      back, so this wraps them with the appropriate wrapper.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4f06d
    • R
      Allow arch to initialize arch field of the module structure · e61a1c1c
      Roman Zippel 提交于
      This will later allow an arch to add module specific information via linker
      generated tables instead of poking directly in the module object structure.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e61a1c1c
    • T
      clocksource: fix resume logic · b52f52a0
      Thomas Gleixner 提交于
      We need to make sure that the clocksources are resumed, when timekeeping is
      resumed.  The current resume logic does not guarantee this.
      
      Add a resume function pointer to the clocksource struct, so clocksource
      drivers which need to reinitialize the clocksource can provide a resume
      function.
      
      Add a resume function, which calls the maybe available clocksource resume
      functions and resets the watchdog function, so a stable TSC can be used
      accross suspend/resume.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b52f52a0
    • C
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter 提交于
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
    • C
      vmstat: use our own timer events · d1187ed2
      Christoph Lameter 提交于
      vmstat is currently using the cache reaper to periodically bring the
      statistics up to date.  The cache reaper does only exists in SLUB as a way to
      provide compatibility with SLAB.  This patch removes the vmstat calls from the
      slab allocators and provides its own handling.
      
      The advantage is also that we can use a different frequency for the updates.
      Refreshing vm stats is a pretty fast job so we can run this every second and
      stagger this by only one tick.  This will lead to some overlap in large
      systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
      updates occurring at once.
      
      However, the vm stats update only accesses per node information.  It is only
      necessary to stagger the vm statistics updates per processor in each node.  Vm
      counter updates occurring on distant nodes will not cause cacheline
      contention.
      
      We could implement an alternate approach that runs the first processor on each
      node at the second and then each of the other processor on a node on a
      subsequent tick.  That may be useful to keep a large amount of the second free
      of timer activity.  Maybe the timer folks will have some feedback on this one?
      
      [jirislaby@gmail.com: add missing break]
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1187ed2
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
    • N
      fs: deprecate memclear_highpage_flush · f37bc271
      Nate Diller 提交于
      Now that all the in-tree users are converted over to zero_user_page(),
      deprecate the old memclear_highpage_flush() call.
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f37bc271
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
    • E
      FUTEX: new PRIVATE futexes · 34f01cc1
      Eric Dumazet 提交于
        Analysis of current linux futex code :
        --------------------------------------
      
      A central hash table futex_queues[] holds all contexts (futex_q) of waiting
      threads.
      
      Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
      perform lookups or insert/deletion of a futex_q.
      
      When a futex_wait() is done, calling thread has to :
      
      1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
           (calling find_vma()). This validation tells us if the futex uses
           an inode based store (mapped file), or mm based store (anonymous mem)
      
      2) - compute a hash key
      
      3) - Atomic increment of reference counter on an inode or a mm_struct
      
      4) - lock part of futex_queues[] hash table
      
      5) - perform the test on value of futex.
      	(rollback is value != expected_value, returns EWOULDBLOCK)
      	(various loops if test triggers mm faults)
      
      6) queue the context into hash table, release the lock got in 4)
      
      7) - release the read_lock on mmap_sem
      
         <block>
      
      8) Eventually unqueue the context (but rarely, as this part  may be done
         by the futex_wake())
      
      Futexes were designed to improve scalability but current implementation has
      various problems :
      
      - Central hashtable :
      
        This means scalability problems if many processes/threads want to use
        futexes at the same time.
        This means NUMA unbalance because this hashtable is located on one node.
      
      - Using mmap_sem on every futex() syscall :
      
        Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
        ops on mmap_sem, dirtying cache line :
          - lot of cache line ping pongs on SMP configurations.
      
        mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
        Highly threaded processes might suffer from mmap_sem contention.
      
        mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
        programs because of contention on the mmap_sem cache line.
      
      - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
        It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
        because of cache misses.
      
      Most of these scalability problems come from the fact that futexes are in
      one global namespace.  As we use a central hash table, we must make sure
      they are all using the same reference (given by the mm subsystem).  We
      chose to force all futexes be 'shared'.  This has a cost.
      
      But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
      and optimal performance if carefuly implemented.  Time has come for linux
      to have better threading performance.
      
      The goal is to permit new futex commands to avoid :
       - Taking the mmap_sem semaphore, conflicting with other subsystems.
       - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.
      
      This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
      futexes, we only need to distinguish futexes by their virtual address, no
      matter the underlying mm storage is.
      
      If glibc wants to exploit this new infrastructure, it should use new
      _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
      prepared to fallback on old subcommands for old kernels.  Using one global
      variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.
      
      PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.
      
      Compatibility with old applications is preserved, they still hit the
      scalability problems, but new applications can fly :)
      
      Note : the same SHARED futex (mapped on a file) can be used by old binaries
      *and* new binaries, because both binaries will use the old subcommands.
      
      Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
      as this is the default semantic. Almost all applications should benefit
      of this changes (new kernel and updated libc)
      
      Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)
      
      /* calling futex_wait(addr, value) with value != *addr */
      433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
      424 cycles per futex(FUTEX_WAIT) call (using one futex)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
      For reference :
      187 cycles per getppid() call
      188 cycles per umask() call
      181 cycles per ni_syscall() call
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Pierre Peiffer <pierre.peiffer@bull.net>
      Cc: "Ulrich Drepper" <drepper@gmail.com>
      Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
      Cc: "Ingo Molnar" <mingo@elte.hu>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f01cc1
    • P
      futex_requeue_pi optimization · d0aa7a70
      Pierre Peiffer 提交于
      This patch provides the futex_requeue_pi functionality, which allows some
      threads waiting on a normal futex to be requeued on the wait-queue of a
      PI-futex.
      
      This provides an optimization, already used for (normal) futexes, to be used
      with the PI-futexes.
      
      This optimization is currently used by the glibc in pthread_broadcast, when
      using "normal" mutexes.  With futex_requeue_pi, it can be used with
      PRIO_INHERIT mutexes too.
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0aa7a70
    • P
      Make futex_wait() use an hrtimer for timeout · c19384b5
      Pierre Peiffer 提交于
      This patch modifies futex_wait() to use an hrtimer + schedule() in place of
      schedule_timeout().
      
      schedule_timeout() is tick based, therefore the timeout granularity is the
      tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
      for timeout wakeup, we can attain a much finer timeout granularity (in the
      microsecond range).  This parallels what is already done for futex_lock_pi().
      
      The timeout passed to the syscall is no longer converted to jiffies and is
      therefore passed to do_futex() and futex_wait() as an absolute ktime_t
      therefore keeping nanosecond resolution.
      
      Also this removes the need to pass the nanoseconds timeout part to
      futex_lock_pi() in val2.
      
      In futex_wait(), if there is no timeout then a regular schedule() is
      performed.  Otherwise, an hrtimer is fired before schedule() is called.
      
      [akpm@linux-foundation.org: fix `make headers_check']
      Signed-off-by: NSebastien Dugue <sebastien.dugue@bull.net>
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c19384b5
    • A
      declare struct ktime · f34c506b
      Andrew Morton 提交于
      Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
      forward declare it.
      
      Create a new `union ktime', map ktime_t onto that.  Now we need to kill off
      this ktime_t thing.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f34c506b
    • A
      aio is unlikely · b8522ead
      Andrew Morton 提交于
      Stick an unlikely() around is_aio(): I assert that most IO is synchronous.
      
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8522ead
    • J
      RPC: add wrapper for svc_reserve to account for checksum · cd123012
      Jeff Layton 提交于
      When the kernel calls svc_reserve to downsize the expected size of an RPC
      reply, it fails to account for the possibility of a checksum at the end of
      the packet.  If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
      then you'll generally see messages similar to this in the server's ring
      buffer:
      
      RPC request reserved 164 but used 208
      
      While I was never able to verify it, I suspect that this problem is also
      the root cause of some oopses I've seen under these conditions:
      
      https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726
      
      This is probably also a problem for other sec= types and for NFSv4.  The
      large reserved size for NFSv4 compound packets seems to generally paper
      over the problem, however.
      
      This patch adds a wrapper for svc_reserve that accounts for the possibility
      of a checksum.  It also fixes up the appropriate callers of svc_reserve to
      call the wrapper.  For now, it just uses a hardcoded value that I
      determined via testing.  That value may need to be revised upward as things
      change, or we may want to eventually add a new auth_op that attempts to
      calculate this somehow.
      
      Unfortunately, there doesn't seem to be a good way to reliably determine
      the expected checksum length prior to actually calculating it, particularly
      with schemes like spkm3.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NNeil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd123012
    • N
      knfsd: rename sk_defer_lock to sk_lock · 7ac1bea5
      NeilBrown 提交于
      Now that sk_defer_lock protects two different things, make the name more
      generic.
      
      Also don't bother with disabling _bh as the lock is only ever taken from
      process context.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ac1bea5
    • A
      remove nfs4_acl_add_ace() · 8842c965
      Adrian Bunk 提交于
      nfs4_acl_add_ace() can now be removed.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Acked-by: NNeil Brown <neilb@cse.unsw.edu.au>
      Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8842c965
    • O
      change kernel threads to ignore signals instead of blocking them · 10ab825b
      Oleg Nesterov 提交于
      Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
      signals.  This doesn't prevent the signal delivery, this only blocks
      signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
      leak.
      
      Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
      them (make a new helper ignore_signals() for that).  If the kernel thread
      needs some signal, it should use allow_signal() anyway, and in that case it
      should not use CLONE_SIGHAND.
      
      Note that we can't change daemonize() (should die!) in the same way,
      because it can be used along with CLONE_SIGHAND.  This means that
      allow_signal() still should unblock the signal to work correctly with
      daemonize()ed threads.
      
      However, disallow_signal() doesn't block the signal any longer but ignores
      it.
      
      NOTE: with or without this patch the kernel threads are not protected from
      handle_stop_signal(), this seems harmless, but not good.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10ab825b
    • E
      kthread: don't depend on work queues · 73c27992
      Eric W. Biederman 提交于
      Currently there is a circular reference between work queue initialization
      and kthread initialization.  This prevents the kthread infrastructure from
      initializing until after work queues have been initialized.
      
      We want the properties of tasks created with kthread_create to be as close
      as possible to the init_task and to not be contaminated by user processes.
      The later we start our kthreadd that creates these tasks the harder it is
      to avoid contamination from user processes and the more of a mess we have
      to clean up because the defaults have changed on us.
      
      So this patch modifies the kthread support to not use work queues but to
      instead use a simple list of structures, and to have kthreadd start from
      init_task immediately after our kernel thread that execs /sbin/init.
      
      By being a true child of init_task we only have to change those process
      settings that we want to have different from init_task, such as our process
      name, the cpus that are allowed, blocking all signals and setting SIGCHLD
      to SIG_IGN so that all of our children are reaped automatically.
      
      By being a true child of init_task we also naturally get our ppid set to 0
      and do not wind up as a child of PID == 1.  Ensuring that tasks generated
      by kthread_create will not slow down the functioning of the wait family of
      functions.
      
      [akpm@linux-foundation.org: use interruptible sleeps]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73c27992