1. 20 7月, 2007 16 次提交
    • F
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu 提交于
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1
    • F
      readahead: data structure and routines · 5ce1110b
      Fengguang Wu 提交于
      Extend struct file_ra_state to support the on-demand readahead logic.  Also
      define some helpers for it.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ce1110b
    • F
      readahead: introduce PG_readahead · d77c2d7c
      Fengguang Wu 提交于
      Introduce a new page flag: PG_readahead.
      
      It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
      invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
      it runs out of cached pages!
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d77c2d7c
    • D
      AIO sparse fix (type of ki_flags) · 2ba2d003
      David Brownell 提交于
      Fix type issue reported by latest 'sparse': kiocb.ki_flags should be
      "unsigned long" (not "long"), to match bitop type signature.
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ba2d003
    • A
      unregister_chrdev() return void · e53252d9
      Akinobu Mita 提交于
      unregister_chrdev() does not return meaningful value.  This patch makes it
      return void like most unregister_* functions.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e53252d9
    • P
      PM: Integrate beeping flag with existing acpi_sleep flags · 77afcf78
      Pavel Machek 提交于
      Move "debug during resume from s2ram" into the variable we already use
      for real-mode flags to simplify code. It also closes nasty trap for
      the user in acpi_sleep_setup; order of parameters actually mattered there,
      acpi_sleep=s3_bios,s3_mode doing something different from
      acpi_sleep=s3_mode,s3_bios.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77afcf78
    • N
      PM: Optional beeping during resume from suspend to RAM · 5a60d623
      Nigel Cunningham 提交于
      Add a feature allowing the user to make the system beep during a resume from
      suspend to RAM, on x86_64 and i386.
      
      This is useful for the users with broken resume from RAM, so that they can
      verify if the control reaches the kernel after a wake-up event.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a60d623
    • R
      PM: Introduce pm_power_off_prepare · bd804eba
      Rafael J. Wysocki 提交于
      Introduce the pm_power_off_prepare() callback that can be registered by the
      interested platforms in analogy with pm_idle() and pm_power_off(), used for
      preparing the system to power off (needed by ACPI).
      
      This allows us to drop acpi_sysclass and device_acpi that are only defined in
      order to register the ACPI power off preparation callback, which is needed by
      pm_power_off() registered in a much different way.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd804eba
    • R
      PM: introduce hibernation and suspend notifiers · b10d9117
      Rafael J. Wysocki 提交于
      Make it possible to register hibernation and suspend notifiers, so that
      subsystems can perform hibernation-related or suspend-related operations that
      should not be carried out by device drivers' .suspend() and .resume()
      routines.
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b10d9117
    • R
      Freezer: avoid freezing kernel threads prematurely · 0c1eecfb
      Rafael J. Wysocki 提交于
      Kernel threads should not have TIF_FREEZE set when user space processes are
      being frozen, since otherwise some of them might be frozen prematurely.
      To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
      unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
      check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
      when user space processes are to be frozen.
      
      Namely, when user space processes are being frozen, we only should set
      TIF_FREEZE for tasks that have p->mm different from NULL and don't have
      PF_BORROWED_MM set in p->flags.  For this reason task_lock() must be used to
      prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
      p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p).  Also, we
      need to prevent the following scenario from happening:
      
      * daemonize() is called by a task spawned from a user space code path
      * freezer checks if the task has p->mm set and the result is positive
      * task enters exit_mm() and clears its TIF_FREEZE
      * freezer sets TIF_FREEZE for the task
      * task calls try_to_freeze() and goes to the refrigerator, which is wrong at
        that point
      
      This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
      p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
      out that TIF_FREEZE should not be set).
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c1eecfb
    • R
      swsusp: introduce restore platform operations · a634cc10
      Rafael J. Wysocki 提交于
      At least on some machines it is necessary to prepare the ACPI firmware for the
      restoration of the system memory state from the hibernation image if the
      "platform" mode of hibernation has been used.  Namely, in that cases we need
      to disable the GPEs before replacing the "boot" kernel with the "frozen"
      kernel (cf.  http://bugzilla.kernel.org/show_bug.cgi?id=7887).  After the
      restore they will be re-enabled by hibernation_ops->finish(), but if the
      restore fails, they have to be re-enabled by the restore code explicitly.
      
      For this purpose we can introduce two additional hibernation operations,
      called pre_restore() and restore_cleanup() and call them from the restore code
      path.  Still, they should be called if the "platform" mode of hibernation has
      been used, so we need to pass the information about the hibernation mode from
      the "frozen" kernel to the "boot" kernel in the image header.
      
      Apparently, we can't drop the disabling of GPEs before the restore because of
      Bug #7887 .   We also can't do it unconditionally, because the GPEs wouldn't
      have been enabled after a successful restore if the suspend had been done in
      the 'shutdown' or 'reboot' mode.
      
      In principle we could (and probably should) unconditionally disable the GPEs
      before each snapshot creation *and* before the restore, but then we'd have to
      unconditionally enable them after the snapshot creation as well as after the
      restore (or restore failure)   Still, for this purpose we'd need to modify
      acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to
      introduce some mechanism synchronizing the disablind/enabling of the GPEs with
      the device drivers' .suspend()/.resume() routines and with
      disable_/enable_nonboot_cpus().   However, this would have affected the
      suspend (ie.  s2ram) code as well as the hibernation, which I'd like to avoid
      in this patch series.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a634cc10
    • M
      Remove alloc_zeroed_user_highpage() · bb2d5ce1
      Mel Gorman 提交于
      alloc_zeroed_user_highpage() has no in-tree users and it is not exported.
      As it is not exported, it can simply be removed.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2d5ce1
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
    • N
      mm: fix fault vs invalidate race for linear mappings · d00806b1
      Nick Piggin 提交于
      Fix the race between invalidate_inode_pages and do_no_page.
      
      Andrea Arcangeli identified a subtle race between invalidation of pages from
      pagecache with userspace mappings, and do_no_page.
      
      The issue is that invalidation has to shoot down all mappings to the page,
      before it can be discarded from the pagecache.  Between shooting down ptes to
      a particular page, and actually dropping the struct page from the pagecache,
      do_no_page from any process might fault on that page and establish a new
      mapping to the page just before it gets discarded from the pagecache.
      
      The most common case where such invalidation is used is in file truncation.
      This case was catered for by doing a sort of open-coded seqlock between the
      file's i_size, and its truncate_count.
      
      Truncation will decrease i_size, then increment truncate_count before
      unmapping userspace pages; do_no_page will read truncate_count, then find the
      page if it is within i_size, and then check truncate_count under the page
      table lock and back out and retry if it had subsequently been changed (ptl
      will serialise against unmapping, and ensure a potentially updated
      truncate_count is actually visible).
      
      Complexity and documentation issues aside, the locking protocol fails in the
      case where we would like to invalidate pagecache inside i_size.  do_no_page
      can come in anytime and filemap_nopage is not aware of the invalidation in
      progress (as it is when it is outside i_size).  The end result is that
      dangling (->mapping == NULL) pages that appear to be from a particular file
      may be mapped into userspace with nonsense data.  Valid mappings to the same
      place will see a different page.
      
      Andrea implemented two working fixes, one using a real seqlock, another using
      a page->flags bit.  He also proposed using the page lock in do_no_page, but
      that was initially considered too heavyweight.  However, it is not a global or
      per-file lock, and the page cacheline is modified in do_no_page to increment
      _count and _mapcount anyway, so a further modification should not be a large
      performance hit.  Scalability is not an issue.
      
      This patch implements this latter approach.  ->nopage implementations return
      with the page locked if it is possible for their underlying file to be
      invalidated (in that case, they must set a special vm_flags bit to indicate
      so).  do_no_page only unlocks the page after setting up the mapping
      completely.  invalidation is excluded because it holds the page lock during
      invalidation of each page (and ensures that the page is not mapped while
      holding the lock).
      
      This also allows significant simplifications in do_no_page, because we have
      the page locked in the right place in the pagecache from the start.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d00806b1
  2. 19 7月, 2007 9 次提交
  3. 18 7月, 2007 15 次提交
    • J
      xen: add virtual block device driver. · 9f27ee59
      Jeremy Fitzhardinge 提交于
      The block device frontend driver allows the kernel to access block
      devices exported exported by a virtual machine containing a physical
      block device driver.
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      9f27ee59
    • J
      xen: add pinned page flag · c85b04c3
      Jeremy Fitzhardinge 提交于
      Add a new definition for PG_owner_priv_1 to define PG_pinned on Xen
      pagetable pages.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      c85b04c3
    • J
      Allocate and free vmalloc areas · 5f4352fb
      Jeremy Fitzhardinge 提交于
      Allocate/release a chunk of vmalloc address space:
       alloc_vm_area reserves a chunk of address space, and makes sure all
       the pagetables are constructed for that address range - but no pages.
      
       free_vm_area releases the address space range.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: "Jan Beulich" <JBeulich@novell.com>
      Cc: "Andi Kleen" <ak@muc.de>
      5f4352fb
    • J
      use elfnote.h to generate vsyscall notes. · 810bab44
      Jeremy Fitzhardinge 提交于
      Use existing elfnote.h to generate vsyscall notes, rather than doing
      it locally.  Changes elfnote.h a bit to suit, since this is the first
      asm user, and it wasn't quite right.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.com>
      810bab44
    • J
      usermodehelper: Tidy up waiting · 86313c48
      Jeremy Fitzhardinge 提交于
      Rather than using a tri-state integer for the wait flag in
      call_usermodehelper_exec, define a proper enum, and use that.  I've
      preserved the integer values so that any callers I've missed should
      still work OK.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Howells <dhowells@redhat.com>
      86313c48
    • J
      Add common orderly_poweroff() · 10a0a8d4
      Jeremy Fitzhardinge 提交于
      Various pieces of code around the kernel want to be able to trigger an
      orderly poweroff.  This pulls them together into a single
      implementation.
      
      By default the poweroff command is /sbin/poweroff, but it can be set
      via sysctl: kernel/poweroff_cmd.  This is split at whitespace, so it
      can include command-line arguments.
      
      This patch replaces four other instances of invoking either "poweroff"
      or "shutdown -h now": two sbus drivers, and acpi thermal
      management.
      
      sparc64 has its own "powerd"; still need to determine whether it should
      be replaced by orderly_poweroff().
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: NLen Brown <lenb@kernel.org>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David S. Miller <davem@davemloft.net>
      10a0a8d4
    • J
      usermodehelper: split setup from execution · 0ab4dc92
      Jeremy Fitzhardinge 提交于
      Rather than having hundreds of variations of call_usermodehelper for
      various pieces of usermode state which could be set up, split the
      info allocation and initialization from the actual process execution.
      
      This means the general pattern becomes:
       info = call_usermodehelper_setup(path, argv, envp); /* basic state */
       call_usermodehelper_<SET EXTRA STATE>(info, stuff...);	/* extra state */
       call_usermodehelper_exec(info, wait);	/* run process and free info */
      
      This patch introduces wrappers for all the existing calling styles for
      call_usermodehelper_*, but folds their implementations into one.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Bj?rn Steinbrink <B.Steinbrink@gmx.de>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      0ab4dc92
    • J
      add argv_split() · d84d1cc7
      Jeremy Fitzhardinge 提交于
      argv_split() is a helper function which takes a string, splits it at
      whitespace, and returns a NULL-terminated argv vector.  This is
      deliberately simple - it does no quote processing of any kind.
      
      [ Seems to me that this is something which is already being done in
        the kernel, but I couldn't find any other implementations, either to
        steal or replace.  Keep an eye out. ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      d84d1cc7
    • J
      add kstrndup · 1e66df3e
      Jeremy Fitzhardinge 提交于
      Add a kstrndup function, modelled on strndup.  Like strndup this
      returns a string copied into its own allocated memory, but it copies
      no more than the specified number of bytes from the source.
      
      Remove private strndup() from irda code.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@mandriva.com>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Panagiotis Issaris <takis@issaris.org>
      Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
      1e66df3e
    • M
      zs: move to the serial subsystem · 8b4a4080
      Maciej W. Rozycki 提交于
      This is a reimplementation of the zs driver for the serial subsystem.  Any
      resemblance to the old driver is purely coincidential.  ;-) I do hope I got
      the handling of modem lines right -- better do not tackle me about the
      issue unless you feel too good...
      
      Any users of the old driver: please note the numbers of the serial lines
      have now been swapped, i.e.  ttyS0 <-> ttyS1 and ttyS2 <-> ttyS3.  It has
      to do with the modem lines mentioned above; basically the port A in a given
      chip has to be initialised before the port B if you want to use the latter
      as the serial console (which is usually the case), as operations on modem
      lines of the serial line associated with the port B access both ports (see
      the comment at the top of the driver for the details of wiring used).
      Please update your scripts.
      
      This is also the reason each SCC now requests an IRQ once only (as seen in
      "/proc/interrupts") -- the handler takes care of both ports at once as the
      line associated with the port B has to take status update interrupts from
      both ports (and yet the line of the port A takes its own for itself too).
      The old driver never got it right...
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b4a4080
    • Y
      serial: add early_serial_setup() back to header file · b187f180
      Yinghai Lu 提交于
      early_serial_setup was removed from serial.h, but forgot to put in
      serial_8250.h
      Signed-off-by: NYinghai Lu <yinghai.lu@sun.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b187f180
    • A
      ext4: Remove 65000 subdirectory limit · f8628a14
      Andreas Dilger 提交于
      This patch adds support to ext4 for allowing more than 65000
      subdirectories. Currently the maximum number of subdirectories is capped
      at 32000.
      
      If we exceed 65000 subdirectories in an htree directory it sets the
      inode link count to 1 and no longer counts subdirectories.  The
      directory link count is not actually used when determining if a
      directory is empty, as that only counts subdirectories and not regular
      files that might be in there. 
      
      A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
      the subdir count for any directory crosses 65000. A later fsck will clear
      EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory
      with >65000 subdirs.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      f8628a14
    • K
      ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields · 6dd4ee7c
      Kalpak Shah 提交于
      We need to make sure that existing ext3 filesystems can also avail the
      new fields that have been added to the ext4 inode. We use
      s_want_extra_isize and s_min_extra_isize to decide by how much we should
      expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set
      then we expand the inode by max(s_want_extra_isize, s_min_extra_isize ,
      sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is
      still an open question about whether users should be able to set
      s_*_extra_isize smaller than the known fields or not.
      
      This patch also adds the functionality to expand inodes to include the
      newly added fields. We start by trying to expand by s_want_extra_isize
      bytes and if its fails we try to expand by s_min_extra_isize bytes. This
      is done by changing the i_extra_isize if enough space is available in
      the inode and no EAs are present. If EAs are present and there is enough
      space in the inode then the EAs in the inode are shifted to make space.
      If enough space is not available in the inode due to the EAs then 1 or
      more EAs are shifted to the external EA block. In the worst case when
      even the external EA block does not have enough space we inform the user
      that some EA would need to be deleted or s_min_extra_isize would have to
      be reduced.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6dd4ee7c
    • K
      ext4: Add nanosecond timestamps · ef7f3835
      Kalpak Shah 提交于
      This patch adds nanosecond timestamps for ext4. This involves adding
      *time_extra fields to the ext4_inode to extend the timestamps to
      64-bits.  Creation time is also added by this patch.
      
      These extended fields will fit into an inode if the filesystem was
      formatted with large inodes (-I 256 or larger) and there are currently
      no EAs consuming all of the available space. For new inodes we always
      reserve enough space for the kernel's known extended fields, but for
      inodes created with an old kernel this might not have been the case. So
      this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature
      flag(ro-compat so that older kernels can't create inodes with a smaller
      extra_isize). which indicates if the fields fitting inside
      s_min_extra_isize are available or not.  If the expansion of inodes if
      unsuccessful then this feature will be disabled.  This feature is only
      enabled if requested by the sysadmin.
      
      None of the extended inode fields is critical for correct filesystem
      operation.
      Signed-off-by: NAndreas Dilger <adilger@clusterfs.com>
      Signed-off-by: NKalpak Shah <kalpak@clusterfs.com>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ef7f3835
    • J
      jbd2: Move jbd2-debug file to debugfs · 0f49d5d0
      Jose R. Santos 提交于
      The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but it
      incorrectly used create_proc_entry() instead of the sysctl routines, and
      no proc entry was ever created.
      
      Instead of fixing this we might as well move the jbd2-debug file to
      debugfs which would be the preferred location for this kind of tunable.
      The new location is now /sys/kernel/debug/jbd2/jbd2-debug.
      Signed-off-by: NJose R. Santos <jrs@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0f49d5d0