1. 11 5月, 2007 10 次提交
  2. 10 5月, 2007 30 次提交
    • S
      [POWERPC] pmu_sys_suspended is only defined for PPC32 · 49d687b6
      Stephen Rothwell 提交于
      thus we get a link error on ppc64 with CONFIG_PM=y.  This fixes it.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      49d687b6
    • L
      acpi,msi-laptop: Fall back to EC polling mode for MSI laptop specific EC commands · 00eb43a1
      Lennart Poettering 提交于
      The ACPI EC that is used in MSI laptops knows some non-standard
      commands for changing the screen brighntess and a few other things,
      which are used by the msi-laptop.c driver. Unfortunately for these
      commands no GPE events for IBF and OBF are triggered. Since nowadays
      the EC code uses the ec_intr=1 mode by default, this causes these
      operations to timeout, although they don't fail. In result, all
      operations that you can do with the msi-laptop.c driver take more or
      less 1s to complete, which is awfully slow.
      
      In one of the more recent kernels (2.6.20?) the EC subsystem has been
      revamped. With that change the EC timeout has been increased. before
      that increase the MSI EC accesses were slow -- but not *that* slow,
      hence I took notice of this limitation of the MSI EC hardware only very
      recently.
      
      The standard EC operations on the MSI EC as defined in the ACPI spec
      support GPE events properly.
      
      The following patch adds a new argument "force_poll" to the
      ec_transaction() function (and friends). If set to 1, the function
      will poll for IBF/OBF even if ec_intr=1 is enabled. If set to 0 the
      current behaviour is used. The msi-laptop driver is modified to make
      use of this new flag, so that OBF/IBF is polled for the special MSI EC
      transactions -- but only for them.
      Signed-off-by: NLennart Poettering <mzxreary@0pointer.de>
      Acked-by: NAlexey Starikovskiy <aystarik@gmail.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      00eb43a1
    • L
      Revert "md: improve partition detection in md array" · 44ce6294
      Linus Torvalds 提交于
      This reverts commit 5b479c91.
      
      Quoth Neil Brown:
      
        "It causes an oops when auto-detecting raid arrays, and it doesn't
         seem easy to fix.
      
         The array may not be 'open' when do_md_run is called, so
         bdev->bd_disk might be NULL, so bd_set_size can oops.
      
         This whole approach of opening an md device before it has been
         assembled just seems to get more and more painful.  I think I'm going
         to have to come up with something clever to provide both backward
         comparability with usage expectation, and sane integration into the
         rest of the kernel."
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44ce6294
    • B
      ide: legacy PCI bus order probing fixes · 6d208b39
      Bartlomiej Zolnierkiewicz 提交于
      IDE PCI host drivers should register themselves with IDE core only when
      IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI
      host drivers are also modular) the code has no effect and just complicates
      the probing.
      
      Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when
      needed and invisible to the user) and covering by #ifdef/#endif the code
      in question.  It turned out that "ide=reverse" was silently accepted but did
      nothing in case when IDE driver was modular, this is fixed now.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      6d208b39
    • B
      ide: add ide_proc_register_port() · 5cbf79cd
      Bartlomiej Zolnierkiewicz 提交于
      * create_proc_ide_interfaces() tries to add /proc entries for every probed
        and initialized IDE port, replace it by ide_proc_register_port() which does
        it only for the given port (also rename destroy_proc_ide_interface() to
        ide_proc_unregister_port() for consistency)
        
      * convert {create,destroy}_proc_ide_interface[s]() users to use new functions
      
      * pmac driver depended on proc_ide_create() to add /proc port entries, fix it
        
      * au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic
        driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them
      
      * there is now no need to add /proc entries for IDE ports in proc_ide_create()
        so don't do it
      
      * proc_ide_create() needs now to be called before drivers are probed - fix it,
        while at it make proc_ide_create() create /proc "ide" directory
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      5cbf79cd
    • B
      ide: add "initializing" argument to ide_register_hw() · 869c56ee
      Bartlomiej Zolnierkiewicz 提交于
      Add "initializing" argument to ide_register_hw() and use it instead of ide.c
      wide variable of the same name.  Update all users of ide_register_hw()
      accordingly.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      869c56ee
    • B
      ide: cable detection fixes (take 2) · 7f8f48af
      Bartlomiej Zolnierkiewicz 提交于
      Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough
      review of the cable detection code...
      
      * print user-friendly warning about limiting the maximum transfer speed
        to UDMA33 (and the reason behind it) when 80-wire cable is not detected,
        also while at it cleanup eighty_ninty_three() a bit
      
      * use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs:
        - bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1
          were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case
          (please see FIXME comment in eighty_ninty_three() for more details)
        - CONFIG_IDEDMA_IVB=y/n cases were interchanged
        - check for SATA devices was missing
      
      * remove private cable warnings from pdc_202xx{old,new} drivers now that core
        code provides this functionality (plus, in pdc202xx_new case the test could
        give false warnings for ATAPI devices because pdc202xx_new driver doesn't
        even support ATAPI DMA)
      
      Cc: Tejun Heo <htejun@gmail.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      7f8f48af
    • B
      ide: move IDE settings handling to ide-proc.c · 7662d046
      Bartlomiej Zolnierkiewicz 提交于
      * move
      	__ide_add_setting()
      	ide_add_setting()
      	__ide_remove_setting()
      	auto_remove_settings()
      	ide_find_setting_by_name()
      	ide_read_setting()
      	ide_write_setting()
      	set_xfer_rate()
      	ide_add_generic_settings()
      	ide_register_subdriver()
      	ide_unregister_subdriver()
      
        from ide.c to ide-proc.c
      
      * set_{io_32bit,pio_mode,using_dma}() cannot be marked static now, fix it
      
      * rename ide_[un]register_subdriver() to ide_proc_[un]register_driver(),
        update device drivers to use new names
      
      * add CONFIG_IDE_PROC_FS=n versions of ide_proc_[un]register_driver()
        and ide_add_generic_settings()
      
      * make ide_find_setting_by_name(), ide_{read,write}_setting()
        and ide_{add,remove}_proc_entries() static
      
      * cover IDE settings code in device drivers with CONFIG_IDE_PROC_FS #ifdef,
        also while at it cover with CONFIG_IDE_PROC_FS #ifdef ide_driver_t.proc
      
      * remove bogus comment from ide.h
      
      * cover with CONFIG_IDE_PROC_FS #ifdef .proc and .settings in ide_drive_t
      
      Besides saner code this patch results in the IDE core smaller by ~2 kB
      (on x86-32) and IDE disk driver by ~1 kB (ditto) when CONFIG_IDE_PROC_FS=n.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      7662d046
    • B
      ide: split off ioctl handling from IDE settings (v2) · 1497943e
      Bartlomiej Zolnierkiewicz 提交于
      * do write permission and min/max checks in ide_procset_t functions
      
      * ide-disk.c: drive->id is always available so cleanup "multcount" setting
        accordingly
      
      * ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA,
        fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field,
        the bug didn't trigger because this IDE setting uses custom ->set function
      
      * ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl
      
      * ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl
      
      * handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl()
        instead of using IDE settings to deal with them
      
      * remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl
        fields from ide_settings_t, also remove now unused TYPE_INTA handling
      
      v2:
      * add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      1497943e
    • B
      ide: make /proc/ide/ optional · ecfd80e4
      Bartlomiej Zolnierkiewicz 提交于
      All important information/features should be already available through
      sysfs and ioctl interfaces.
      
      Add CONFIG_IDE_PROC_FS (CONFIG_SCSI_PROC_FS rip-off) config option,
      disabling it makes IDE driver ~5 kB smaller (on x86-32).
      
      While at it add CONFIG_PROC_FS=n versions of proc_ide_{create,destroy}()
      and remove no longer needed #ifdefs.
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      ecfd80e4
    • B
      ide: add ide_tune_dma() helper · 29e744d0
      Bartlomiej Zolnierkiewicz 提交于
      After reworking the code responsible for selecting the best DMA
      transfer mode it is now possible to add generic ide_tune_dma() helper.
      
      Convert some IDE PCI host drivers to use it (the ones left need more work).
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      29e744d0
    • B
      ide: rework the code for selecting the best DMA transfer mode (v3) · 2d5eaa6d
      Bartlomiej Zolnierkiewicz 提交于
      Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch.
      
      * add ide_hwif_t.udma_filter hook for filtering UDMA mask
        (use it in alim15x3, hpt366, siimage and serverworks drivers)
      * add ide_max_dma_mode() for finding best DMA mode for the device
        (loosely based on some older libata-core.c code)
      * convert ide_dma_speed() users to use ide_max_dma_mode()
      * make ide_rate_filter() take "ide_drive_t *drive" as an argument instead
        of "u8 mode" and teach it to how to use UDMA mask to do filtering
      * use ide_rate_filter() in hpt366 driver
      * remove no longer needed ide_dma_speed() and *_ratemask()
      * unexport eighty_ninty_three()
      
      v2:
      * rename ->filter_udma_mask to ->udma_filter
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      
      v3:
      * updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space
        originated transfer mode change requests when 100MHz clock is used)
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      2d5eaa6d
    • B
      ide: fix UDMA/MWDMA/SWDMA masks (v3) · 18137207
      Bartlomiej Zolnierkiewicz 提交于
      * use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask
      * add udma_mask field to ide_pci_device_t and use it to initialize
        ->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers
      * fix UDMA masks to match with chipset specific *_ratemask()
        (alim15x3, hpt366, serverworks and siimage drivers need UDMA mask
         filtering method - done in the next patch)
      
      v2:
      * piix: fix cable detection for 82801AA_1 and 82372FB_1
        [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      * cmd64x: use hwif->cds->udma_mask
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      * aec62xx: fix newly introduced bug - check DMA status not command register
        [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      
      v3:
      * piix: use hwif->cds->udma_mask
        [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      18137207
    • N
      md: improve partition detection in md array · 5b479c91
      NeilBrown 提交于
      md currently uses ->media_changed to make sure rescan_partitions
      is call on md array after they are assembled.
      
      However that doesn't happen until the array is opened, which is later
      than some people would like.
      
      So use blkdev_ioctl to do the rescan immediately that the
      array has been assembled.
      
      This means we can remove all the ->change infrastructure as it was only used
      to trigger a partition rescan.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b479c91
    • H
      fbdev: add support for AVR32 · 880169dd
      Haavard Skinnemoen 提交于
      Provide framebuffer page protection flags and definitions of
      fb_readl/fb_writel for AVR32.
      Signed-off-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      880169dd
    • A
      svgalib: move fb_get_caps to svgalib · 5a87ede9
      Antonino A. Daplas 提交于
      Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by
      s3fb, arkfb and vt8623fb.
      Signed-off-by: NAntonino Daplas <adaplas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a87ede9
    • D
      compiler: introduce __used and __maybe_unused · 0d7ebbbc
      David Rientjes 提交于
      __used is defined to be __attribute__((unused)) for all pre-3.3 gcc
      compilers to suppress warnings for unused functions because perhaps they
      are referenced only in inline assembly.  It is defined to be
      __attribute__((used)) for gcc 3.3 and later so that the code is still
      emitted for such functions.
      
      __maybe_unused is defined to be __attribute__((unused)) for both function
      and variable use if it could possibly be unreferenced due to the evaluation
      of preprocessor macros.  Function prototypes shall be marked with
      __maybe_unused if the actual definition of the function is dependant on
      preprocessor macros.
      
      No update to compiler-intel.h is necessary because ICC supports both
      __attribute__((used)) and __attribute__((unused)) as specified by the gcc
      manual.
      
      __attribute_used__ is deprecated and will be removed once all current
      code is converted to using __used.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@stusta.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d7ebbbc
    • R
      rename thread_info to stack · f7e4217b
      Roman Zippel 提交于
      This finally renames the thread_info field in task structure to stack, so that
      the assumptions about this field are gone and archs have more freedom about
      placing the thread_info structure.
      
      Nonbroken archs which have a proper thread pointer can do the access to both
      current thread and task structure via a single pointer.
      
      It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
      could benefit.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      [akpm@linux-foundation.org: build fix]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e4217b
    • R
      Allow arch to initialize arch field of the module structure · e61a1c1c
      Roman Zippel 提交于
      This will later allow an arch to add module specific information via linker
      generated tables instead of poking directly in the module object structure.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e61a1c1c
    • T
      clocksource: fix resume logic · b52f52a0
      Thomas Gleixner 提交于
      We need to make sure that the clocksources are resumed, when timekeeping is
      resumed.  The current resume logic does not guarantee this.
      
      Add a resume function pointer to the clocksource struct, so clocksource
      drivers which need to reinitialize the clocksource can provide a resume
      function.
      
      Add a resume function, which calls the maybe available clocksource resume
      functions and resets the watchdog function, so a stable TSC can be used
      accross suspend/resume.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b52f52a0
    • C
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter 提交于
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
    • C
      vmstat: use our own timer events · d1187ed2
      Christoph Lameter 提交于
      vmstat is currently using the cache reaper to periodically bring the
      statistics up to date.  The cache reaper does only exists in SLUB as a way to
      provide compatibility with SLAB.  This patch removes the vmstat calls from the
      slab allocators and provides its own handling.
      
      The advantage is also that we can use a different frequency for the updates.
      Refreshing vm stats is a pretty fast job so we can run this every second and
      stagger this by only one tick.  This will lead to some overlap in large
      systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
      updates occurring at once.
      
      However, the vm stats update only accesses per node information.  It is only
      necessary to stagger the vm statistics updates per processor in each node.  Vm
      counter updates occurring on distant nodes will not cause cacheline
      contention.
      
      We could implement an alternate approach that runs the first processor on each
      node at the second and then each of the other processor on a node on a
      subsequent tick.  That may be useful to keep a large amount of the second free
      of timer activity.  Maybe the timer folks will have some feedback on this one?
      
      [jirislaby@gmail.com: add missing break]
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1187ed2
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
    • N
      fs: deprecate memclear_highpage_flush · f37bc271
      Nate Diller 提交于
      Now that all the in-tree users are converted over to zero_user_page(),
      deprecate the old memclear_highpage_flush() call.
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f37bc271
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
    • E
      FUTEX: new PRIVATE futexes · 34f01cc1
      Eric Dumazet 提交于
        Analysis of current linux futex code :
        --------------------------------------
      
      A central hash table futex_queues[] holds all contexts (futex_q) of waiting
      threads.
      
      Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
      perform lookups or insert/deletion of a futex_q.
      
      When a futex_wait() is done, calling thread has to :
      
      1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
           (calling find_vma()). This validation tells us if the futex uses
           an inode based store (mapped file), or mm based store (anonymous mem)
      
      2) - compute a hash key
      
      3) - Atomic increment of reference counter on an inode or a mm_struct
      
      4) - lock part of futex_queues[] hash table
      
      5) - perform the test on value of futex.
      	(rollback is value != expected_value, returns EWOULDBLOCK)
      	(various loops if test triggers mm faults)
      
      6) queue the context into hash table, release the lock got in 4)
      
      7) - release the read_lock on mmap_sem
      
         <block>
      
      8) Eventually unqueue the context (but rarely, as this part  may be done
         by the futex_wake())
      
      Futexes were designed to improve scalability but current implementation has
      various problems :
      
      - Central hashtable :
      
        This means scalability problems if many processes/threads want to use
        futexes at the same time.
        This means NUMA unbalance because this hashtable is located on one node.
      
      - Using mmap_sem on every futex() syscall :
      
        Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
        ops on mmap_sem, dirtying cache line :
          - lot of cache line ping pongs on SMP configurations.
      
        mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
        Highly threaded processes might suffer from mmap_sem contention.
      
        mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
        programs because of contention on the mmap_sem cache line.
      
      - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
        It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
        because of cache misses.
      
      Most of these scalability problems come from the fact that futexes are in
      one global namespace.  As we use a central hash table, we must make sure
      they are all using the same reference (given by the mm subsystem).  We
      chose to force all futexes be 'shared'.  This has a cost.
      
      But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
      and optimal performance if carefuly implemented.  Time has come for linux
      to have better threading performance.
      
      The goal is to permit new futex commands to avoid :
       - Taking the mmap_sem semaphore, conflicting with other subsystems.
       - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.
      
      This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
      futexes, we only need to distinguish futexes by their virtual address, no
      matter the underlying mm storage is.
      
      If glibc wants to exploit this new infrastructure, it should use new
      _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
      prepared to fallback on old subcommands for old kernels.  Using one global
      variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.
      
      PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.
      
      Compatibility with old applications is preserved, they still hit the
      scalability problems, but new applications can fly :)
      
      Note : the same SHARED futex (mapped on a file) can be used by old binaries
      *and* new binaries, because both binaries will use the old subcommands.
      
      Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
      as this is the default semantic. Almost all applications should benefit
      of this changes (new kernel and updated libc)
      
      Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)
      
      /* calling futex_wait(addr, value) with value != *addr */
      433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
      424 cycles per futex(FUTEX_WAIT) call (using one futex)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
      For reference :
      187 cycles per getppid() call
      188 cycles per umask() call
      181 cycles per ni_syscall() call
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Pierre Peiffer <pierre.peiffer@bull.net>
      Cc: "Ulrich Drepper" <drepper@gmail.com>
      Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
      Cc: "Ingo Molnar" <mingo@elte.hu>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f01cc1
    • P
      futex_requeue_pi optimization · d0aa7a70
      Pierre Peiffer 提交于
      This patch provides the futex_requeue_pi functionality, which allows some
      threads waiting on a normal futex to be requeued on the wait-queue of a
      PI-futex.
      
      This provides an optimization, already used for (normal) futexes, to be used
      with the PI-futexes.
      
      This optimization is currently used by the glibc in pthread_broadcast, when
      using "normal" mutexes.  With futex_requeue_pi, it can be used with
      PRIO_INHERIT mutexes too.
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0aa7a70
    • P
      Make futex_wait() use an hrtimer for timeout · c19384b5
      Pierre Peiffer 提交于
      This patch modifies futex_wait() to use an hrtimer + schedule() in place of
      schedule_timeout().
      
      schedule_timeout() is tick based, therefore the timeout granularity is the
      tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
      for timeout wakeup, we can attain a much finer timeout granularity (in the
      microsecond range).  This parallels what is already done for futex_lock_pi().
      
      The timeout passed to the syscall is no longer converted to jiffies and is
      therefore passed to do_futex() and futex_wait() as an absolute ktime_t
      therefore keeping nanosecond resolution.
      
      Also this removes the need to pass the nanoseconds timeout part to
      futex_lock_pi() in val2.
      
      In futex_wait(), if there is no timeout then a regular schedule() is
      performed.  Otherwise, an hrtimer is fired before schedule() is called.
      
      [akpm@linux-foundation.org: fix `make headers_check']
      Signed-off-by: NSebastien Dugue <sebastien.dugue@bull.net>
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c19384b5
    • A
      declare struct ktime · f34c506b
      Andrew Morton 提交于
      Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
      forward declare it.
      
      Create a new `union ktime', map ktime_t onto that.  Now we need to kill off
      this ktime_t thing.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f34c506b
    • A
      aio is unlikely · b8522ead
      Andrew Morton 提交于
      Stick an unlikely() around is_aio(): I assert that most IO is synchronous.
      
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8522ead