提交 · 88db5e1489f23876a226f5393fd978ddc09dc5f9 · openeuler / raspberrypi-kernel

10 5月, 2007 40 次提交

Revert "ACPICA: fix AML mutex re-entrancy" · 262a7a28

由 Len Brown 提交于 5月 09, 2007

This reverts commit c0d127b5.

These changes to AML locking were made to allow
Notify handlers to be called on the stack
and not deadlock.  However, that scheme turns
out to be flawed and was reverted by the previous commit,
so this commit restores the locking to it previous design.
Signed-off-by: NLen Brown <len.brown@intel.com>

262a7a28

Revert "ACPICA: revert "acpi_serialize" changes" · 4d2acd9e

由 Len Brown 提交于 5月 09, 2007

This reverts commit a8f4af6d.
Thus restoring ACPICA's new acpi_serialize code.

This commit by itself may cause a regression, but
it is reverted in this order so that subsequent
reverts reverts under this one can be made
without conflict.
Signed-off-by: NLen Brown <len.brown@intel.com>

4d2acd9e

Revert "md: improve partition detection in md array" · 44ce6294

由 Linus Torvalds 提交于 5月 09, 2007

This reverts commit 5b479c91.

Quoth Neil Brown:

  "It causes an oops when auto-detecting raid arrays, and it doesn't
   seem easy to fix.

   The array may not be 'open' when do_md_run is called, so
   bdev->bd_disk might be NULL, so bd_set_size can oops.

   This whole approach of opening an md device before it has been
   assembled just seems to get more and more painful.  I think I'm going
   to have to come up with something clever to provide both backward
   comparability with usage expectation, and sane integration into the
   rest of the kernel."
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

44ce6294

ide: legacy PCI bus order probing fixes · 6d208b39

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

IDE PCI host drivers should register themselves with IDE core only when
IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI
host drivers are also modular) the code has no effect and just complicates
the probing.

Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when
needed and invisible to the user) and covering by #ifdef/#endif the code
in question. It turned out that "ide=reverse" was silently accepted but did
nothing in case when IDE driver was modular, this is fixed now.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

6d208b39

ide: add ide_proc_register_port() · 5cbf79cd

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

* create_proc_ide_interfaces() tries to add /proc entries for every probed
  and initialized IDE port, replace it by ide_proc_register_port() which does
  it only for the given port (also rename destroy_proc_ide_interface() to
  ide_proc_unregister_port() for consistency)
  
* convert {create,destroy}_proc_ide_interface[s]() users to use new functions

* pmac driver depended on proc_ide_create() to add /proc port entries, fix it
  
* au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic
  driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them

* there is now no need to add /proc entries for IDE ports in proc_ide_create()
  so don't do it

* proc_ide_create() needs now to be called before drivers are probed - fix it,
  while at it make proc_ide_create() create /proc "ide" directory
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

5cbf79cd

ide: add "initializing" argument to ide_register_hw() · 869c56ee

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

Add "initializing" argument to ide_register_hw() and use it instead of ide.c
wide variable of the same name.  Update all users of ide_register_hw()
accordingly.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

869c56ee

ide: cable detection fixes (take 2) · 7f8f48af

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough
review of the cable detection code...

* print user-friendly warning about limiting the maximum transfer speed
  to UDMA33 (and the reason behind it) when 80-wire cable is not detected,
  also while at it cleanup eighty_ninty_three() a bit

* use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs:
  - bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1
    were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case
    (please see FIXME comment in eighty_ninty_three() for more details)
  - CONFIG_IDEDMA_IVB=y/n cases were interchanged
  - check for SATA devices was missing

* remove private cable warnings from pdc_202xx{old,new} drivers now that core
  code provides this functionality (plus, in pdc202xx_new case the test could
  give false warnings for ATAPI devices because pdc202xx_new driver doesn't
  even support ATAPI DMA)

Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

7f8f48af

ide: move IDE settings handling to ide-proc.c · 7662d046

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

* move
	__ide_add_setting()
	ide_add_setting()
	__ide_remove_setting()
	auto_remove_settings()
	ide_find_setting_by_name()
	ide_read_setting()
	ide_write_setting()
	set_xfer_rate()
	ide_add_generic_settings()
	ide_register_subdriver()
	ide_unregister_subdriver()

  from ide.c to ide-proc.c

* set_{io_32bit,pio_mode,using_dma}() cannot be marked static now, fix it

* rename ide_[un]register_subdriver() to ide_proc_[un]register_driver(),
  update device drivers to use new names

* add CONFIG_IDE_PROC_FS=n versions of ide_proc_[un]register_driver()
  and ide_add_generic_settings()

* make ide_find_setting_by_name(), ide_{read,write}_setting()
  and ide_{add,remove}_proc_entries() static

* cover IDE settings code in device drivers with CONFIG_IDE_PROC_FS #ifdef,
  also while at it cover with CONFIG_IDE_PROC_FS #ifdef ide_driver_t.proc

* remove bogus comment from ide.h

* cover with CONFIG_IDE_PROC_FS #ifdef .proc and .settings in ide_drive_t

Besides saner code this patch results in the IDE core smaller by ~2 kB
(on x86-32) and IDE disk driver by ~1 kB (ditto) when CONFIG_IDE_PROC_FS=n.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

7662d046

ide: split off ioctl handling from IDE settings (v2) · 1497943e

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

* do write permission and min/max checks in ide_procset_t functions

* ide-disk.c: drive->id is always available so cleanup "multcount" setting
  accordingly

* ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA,
  fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field,
  the bug didn't trigger because this IDE setting uses custom ->set function

* ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl

* ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl

* handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl()
  instead of using IDE settings to deal with them

* remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl
  fields from ide_settings_t, also remove now unused TYPE_INTA handling

v2:
* add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

1497943e

ide: make /proc/ide/ optional · ecfd80e4

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

All important information/features should be already available through
sysfs and ioctl interfaces.

Add CONFIG_IDE_PROC_FS (CONFIG_SCSI_PROC_FS rip-off) config option,
disabling it makes IDE driver ~5 kB smaller (on x86-32).

While at it add CONFIG_PROC_FS=n versions of proc_ide_{create,destroy}()
and remove no longer needed #ifdefs.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

ecfd80e4

ide: add ide_tune_dma() helper · 29e744d0

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

After reworking the code responsible for selecting the best DMA
transfer mode it is now possible to add generic ide_tune_dma() helper.

Convert some IDE PCI host drivers to use it (the ones left need more work).
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

29e744d0

ide: rework the code for selecting the best DMA transfer mode (v3) · 2d5eaa6d

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch.

* add ide_hwif_t.udma_filter hook for filtering UDMA mask
  (use it in alim15x3, hpt366, siimage and serverworks drivers)
* add ide_max_dma_mode() for finding best DMA mode for the device
  (loosely based on some older libata-core.c code)
* convert ide_dma_speed() users to use ide_max_dma_mode()
* make ide_rate_filter() take "ide_drive_t *drive" as an argument instead
  of "u8 mode" and teach it to how to use UDMA mask to do filtering
* use ide_rate_filter() in hpt366 driver
* remove no longer needed ide_dma_speed() and *_ratemask()
* unexport eighty_ninty_three()

v2:
* rename ->filter_udma_mask to ->udma_filter
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]

v3:
* updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space
  originated transfer mode change requests when 100MHz clock is used)
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

2d5eaa6d

ide: fix UDMA/MWDMA/SWDMA masks (v3) · 18137207

由 Bartlomiej Zolnierkiewicz 提交于 5月 10, 2007

* use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask
* add udma_mask field to ide_pci_device_t and use it to initialize
  ->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers
* fix UDMA masks to match with chipset specific *_ratemask()
  (alim15x3, hpt366, serverworks and siimage drivers need UDMA mask
   filtering method - done in the next patch)

v2:
* piix: fix cable detection for 82801AA_1 and 82372FB_1
  [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* cmd64x: use hwif->cds->udma_mask
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* aec62xx: fix newly introduced bug - check DMA status not command register
  [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]

v3:
* piix: use hwif->cds->udma_mask
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

18137207

i386: msr.h: be paranoid about types and parentheses · b0b73cb4

由 H. Peter Anvin 提交于 5月 09, 2007

When implementing things as macros, make sure we use typecasts and
parentheses where needed.  The macros as defined were vulnerable to
surreptitious promotion causing problems.

Avoid macros where practical; e.g. wrmsr() can be an inline instead.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b0b73cb4

i386: remove unused rdtsc() macro · 29bd4433

由 H. Peter Anvin 提交于 5月 09, 2007

All users to the two-part rdtsc() macro have already switched to using
rdtscl() or rdtscll().  Remove the now-obsolete macro.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29bd4433

md: improve partition detection in md array · 5b479c91

由 NeilBrown 提交于 5月 09, 2007

md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.

However that doesn't happen until the array is opened, which is later
than some people would like.

So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.

This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.
Signed-off-by: NNeil Brown <neilb@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5b479c91

fbdev: add support for AVR32 · 880169dd

由 Haavard Skinnemoen 提交于 5月 09, 2007

Provide framebuffer page protection flags and definitions of
fb_readl/fb_writel for AVR32.
Signed-off-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

880169dd

svgalib: move fb_get_caps to svgalib · 5a87ede9

由 Antonino A. Daplas 提交于 5月 09, 2007

Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by
s3fb, arkfb and vt8623fb.
Signed-off-by: NAntonino Daplas <adaplas@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5a87ede9

i386 mmzone: use __maybe_unused · c3c117f0

由 David Rientjes 提交于 5月 09, 2007

Replace automatic variable instances of __attribute__ ((unused)) with
__maybe_unused.

Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c3c117f0

sh: dma: use __maybe_unused · d16aaffa

由 David Rientjes 提交于 5月 09, 2007

There is no such thing as labeling a variable as __attribute__((used)). Since
ts_shift is not referenced in inline assembly, we assume that we're simply
suppressing a warning here if the variable is declared but unreferenced.

Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d16aaffa

compiler: introduce __used and __maybe_unused · 0d7ebbbc

由 David Rientjes 提交于 5月 09, 2007

__used is defined to be __attribute__((unused)) for all pre-3.3 gcc
compilers to suppress warnings for unused functions because perhaps they
are referenced only in inline assembly.  It is defined to be
__attribute__((used)) for gcc 3.3 and later so that the code is still
emitted for such functions.

__maybe_unused is defined to be __attribute__((unused)) for both function
and variable use if it could possibly be unreferenced due to the evaluation
of preprocessor macros.  Function prototypes shall be marked with
__maybe_unused if the actual definition of the function is dependant on
preprocessor macros.

No update to compiler-intel.h is necessary because ICC supports both
__attribute__((used)) and __attribute__((unused)) as specified by the gcc
manual.

__attribute_used__ is deprecated and will be removed once all current
code is converted to using __used.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0d7ebbbc

rename thread_info to stack · f7e4217b

由 Roman Zippel 提交于 5月 09, 2007

This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.

Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.

It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
could benefit.
Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f7e4217b

wrap access to thread_info · c9f4f06d

由 Roman Zippel 提交于 5月 09, 2007

Recently a few direct accesses to the thread_info in the task structure snuck
back, so this wraps them with the appropriate wrapper.
Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9f4f06d

Allow arch to initialize arch field of the module structure · e61a1c1c

由 Roman Zippel 提交于 5月 09, 2007

This will later allow an arch to add module specific information via linker
generated tables instead of poking directly in the module object structure.
Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e61a1c1c

clocksource: fix resume logic · b52f52a0

由 Thomas Gleixner 提交于 5月 09, 2007

We need to make sure that the clocksources are resumed, when timekeeping is
resumed.  The current resume logic does not guarantee this.

Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.

Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b52f52a0

Move remote node draining out of slab allocators · 4037d452

由 Christoph Lameter 提交于 5月 09, 2007

Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4037d452

vmstat: use our own timer events · d1187ed2

由 Christoph Lameter 提交于 5月 09, 2007

vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1187ed2

由 Rafael J. Wysocki 提交于 5月 09, 2007

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8bb78442

fs: deprecate memclear_highpage_flush · f37bc271

由 Nate Diller 提交于 5月 09, 2007

Now that all the in-tree users are converted over to zero_user_page(),
deprecate the old memclear_highpage_flush() call.
Signed-off-by: NNate Diller <nate.diller@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f37bc271

fs: convert core functions to zero_user_page · 01f2705d

由 Nate Diller 提交于 5月 09, 2007

It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset().  There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones.  Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call.  The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: NNate Diller <nate.diller@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01f2705d

FUTEX: new PRIVATE futexes · 34f01cc1

由 Eric Dumazet 提交于 5月 09, 2007

  Analysis of current linux futex code :
  --------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
	(rollback is value != expected_value, returns EWOULDBLOCK)
	(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part  may be done
   by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

  This means scalability problems if many processes/threads want to use
  futexes at the same time.
  This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

  Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
  ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

  mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
  Highly threaded processes might suffer from mmap_sem contention.

  mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
  programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
  It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
  because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace.  As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).  We
chose to force all futexes be 'shared'.  This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented.  Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
prepared to fallback on old subcommands for old kernels.  Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net>
Cc: "Ulrich Drepper" <drepper@gmail.com>
Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

34f01cc1

futex_requeue_pi optimization · d0aa7a70

由 Pierre Peiffer 提交于 5月 09, 2007

This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.

This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes.  With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.
Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d0aa7a70

Make futex_wait() use an hrtimer for timeout · c19384b5

由 Pierre Peiffer 提交于 5月 09, 2007

This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

schedule_timeout() is tick based, therefore the timeout granularity is the
tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
for timeout wakeup, we can attain a much finer timeout granularity (in the
microsecond range).  This parallels what is already done for futex_lock_pi().

The timeout passed to the syscall is no longer converted to jiffies and is
therefore passed to do_futex() and futex_wait() as an absolute ktime_t
therefore keeping nanosecond resolution.

Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

In futex_wait(), if there is no timeout then a regular schedule() is
performed.  Otherwise, an hrtimer is fired before schedule() is called.

[akpm@linux-foundation.org: fix `make headers_check']
Signed-off-by: NSebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c19384b5

declare struct ktime · f34c506b

由 Andrew Morton 提交于 5月 09, 2007

Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
forward declare it.

Create a new `union ktime', map ktime_t onto that.  Now we need to kill off
this ktime_t thing.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f34c506b

aio is unlikely · b8522ead

由 Andrew Morton 提交于 5月 09, 2007

Stick an unlikely() around is_aio(): I assert that most IO is synchronous.

Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8522ead

RPC: add wrapper for svc_reserve to account for checksum · cd123012

由 Jeff Layton 提交于 5月 09, 2007

When the kernel calls svc_reserve to downsize the expected size of an RPC
reply, it fails to account for the possibility of a checksum at the end of
the packet.  If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
then you'll generally see messages similar to this in the server's ring
buffer:

RPC request reserved 164 but used 208

While I was never able to verify it, I suspect that this problem is also
the root cause of some oopses I've seen under these conditions:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726

This is probably also a problem for other sec= types and for NFSv4.  The
large reserved size for NFSv4 compound packets seems to generally paper
over the problem, however.

This patch adds a wrapper for svc_reserve that accounts for the possibility
of a checksum.  It also fixes up the appropriate callers of svc_reserve to
call the wrapper.  For now, it just uses a hardcoded value that I
determined via testing.  That value may need to be revised upward as things
change, or we may want to eventually add a new auth_op that attempts to
calculate this somehow.

Unfortunately, there doesn't seem to be a good way to reliably determine
the expected checksum length prior to actually calculating it, particularly
with schemes like spkm3.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NNeil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd123012

knfsd: rename sk_defer_lock to sk_lock · 7ac1bea5

由 NeilBrown 提交于 5月 09, 2007

Now that sk_defer_lock protects two different things, make the name more
generic.

Also don't bother with disabling _bh as the lock is only ever taken from
process context.
Signed-off-by: NNeil Brown <neilb@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ac1bea5

remove nfs4_acl_add_ace() · 8842c965

由 Adrian Bunk 提交于 5月 09, 2007

nfs4_acl_add_ace() can now be removed.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Acked-by: NNeil Brown <neilb@cse.unsw.edu.au>
Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8842c965

change kernel threads to ignore signals instead of blocking them · 10ab825b

由 Oleg Nesterov 提交于 5月 09, 2007

Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals.  This doesn't prevent the signal delivery, this only blocks
signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
leak.

Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that).  If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.

Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND.  This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.

However, disallow_signal() doesn't block the signal any longer but ignores
it.

NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

10ab825b

kthread: don't depend on work queues · 73c27992

由 Eric W. Biederman 提交于 5月 09, 2007

Currently there is a circular reference between work queue initialization
and kthread initialization. This prevents the kthread infrastructure from
initializing until after work queues have been initialized.

We want the properties of tasks created with kthread_create to be as close
as possible to the init_task and to not be contaminated by user processes.
The later we start our kthreadd that creates these tasks the harder it is
to avoid contamination from user processes and the more of a mess we have
to clean up because the defaults have changed on us.

So this patch modifies the kthread support to not use work queues but to
instead use a simple list of structures, and to have kthreadd start from
init_task immediately after our kernel thread that execs /sbin/init.

By being a true child of init_task we only have to change those process
settings that we want to have different from init_task, such as our process
name, the cpus that are allowed, blocking all signals and setting SIGCHLD
to SIG_IGN so that all of our children are reaped automatically.

By being a true child of init_task we also naturally get our ppid set to 0
and do not wind up as a child of PID == 1. Ensuring that tasks generated
by kthread_create will not slow down the functioning of the wait family of
functions.

[akpm@linux-foundation.org: use interruptible sleeps]
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

73c27992