1. 22 6月, 2005 19 次提交
    • H
      [PATCH] shmem: restore superblock info · 0edd73b3
      Hugh Dickins 提交于
      To improve shmem scalability, we allowed tmpfs instances which don't need
      their blocks or inodes limited not to count them, and not to allocate any
      sbinfo.  Which was okay when the only use for the sbinfo was accounting
      blocks and inodes; but since then a couple of unrelated projects extending
      tmpfs want to store other data in the sbinfo.  Whether either extension
      reaches mainline is beside the point: I'm guilty of a bad design decision,
      and should restore sbinfo to make any such future extensions easier.
      
      So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
      now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
      inodes.  Brent Casavant verified (many months ago) that this does not
      perceptibly impact the scalability (since the unlimited sbinfo cacheline is
      repeatedly accessed but only once dirtied).
      
      And merge shmem_set_size into its sole caller shmem_remount_fs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0edd73b3
    • C
      [PATCH] Reduce size of huge boot per_cpu_pageset · 2caaad41
      Christoph Lameter 提交于
      Reduce size of the huge per_cpu_pageset structure in __initdata introduced
      into mm1 with the pageset localization patchset.  Use one specially
      configured pageset per cpu for all zones and nodes during bootup.
      
      - Avoid duplication of pageset initialization code.
      - do the adding to the pageset list before potential free_pages_bulk
        in free_hot_cold_page (otherwise we would have to hold a page
        in a pageset during the period that the boot pagesets are in use).
      - remove mistaken __cpuinitdata attribute and revert back to __initdata
        for the boot pageset. A boot pageset is not necessary for cpu hotplug.
      
      Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
      IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
      sparsemem patches)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2caaad41
    • C
      [PATCH] Periodically drain non local pagesets · 4ae7c039
      Christoph Lameter 提交于
      The pageset array can potentially acquire a huge amount of memory on large
      NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
      will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
      talking about 655360 pages.With a 16K page size on IA64 this results in
      potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
      cases are much less for smaller systems but there is still the potential of
      memory being trapped in off node pagesets.  Off node memory may be rarely
      used if local memory is available and so we may potentially have memory in
      seldom used pagesets without this patch.
      
      The slab allocator flushes its per cpu caches every 2 seconds.  The
      following patch flushes the off node pageset caches in the same way by
      tying into the slab flush.
      
      The patch also changes /proc/zoneinfo to include the number of pages
      currently in each pageset.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4ae7c039
    • J
      [PATCH] add OOM debug · 578c2fd6
      Janet Morgan 提交于
      This patch provides more debug info when the system is OOM.  It displays
      memory stats (basically sysrq-m info) from __alloc_pages() when page
      allocation fails and during OOM kill.
      
      Thanks to Dave Jones for coming up with the idea.
      Signed-off-by: NJanet Morgan <janetmor@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      578c2fd6
    • B
      [PATCH] __read_page_state(): pass unsigned long instead of unsigned · c2f29ea1
      Benjamin LaHaise 提交于
      By making the offset argument of __read_page_state an unsigned long instead of
      unsigned, we can avoid forcing the compiler to sign extend a usually constant
      argument.  This saves 1 instruction on x86-64.
      Signed-off-by: NBenjamin LaHaise <benjamin.c.lahaise@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c2f29ea1
    • B
      [PATCH] __mod_page_state(): pass unsigned long instead of unsigned · 83e5d8f7
      Benjamin LaHaise 提交于
      By making the offset argument of __mod_page_state an unsigned long instead
      of unsigned, we can avoid forcing the compiler to sign extend a usually
      constant argument.  This saves 1 instruction on x86-64.
      Signed-off-by: NBenjamin LaHaise <benjamin.c.lahaise@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      83e5d8f7
    • D
      [PATCH] vm: try_to_free_pages unused argument · 1ad539b2
      Darren Hart 提交于
      try_to_free_pages accepts a third argument, order, but hasn't used it since
      before 2.6.0.  The following patch removes the argument and updates all the
      calls to try_to_free_pages.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1ad539b2
    • C
      [PATCH] mmap topdown fix for large stack limit, large allocation · 73219d17
      Chris Wright 提交于
      The topdown changes in 2.6.12-rc1 can cause large allocations with large
      stack limit to fail, despite there being space available.  The
      mmap_base-len is only valid when len >= mmap_base.  However, nothing in
      topdown allocator checks this.  It's only (now) caught at higher level,
      which will cause allocation to simply fail.  The following change restores
      the fallback to bottom-up path, which will allow large allocations with
      large stack limit to potentially still succeed.
      Signed-off-by: NChris Wright <chrisw@osdl.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      73219d17
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
    • C
      [PATCH] node local per-cpu-pages · e7c8d5c9
      Christoph Lameter 提交于
      This patch modifies the way pagesets in struct zone are managed.
      
      Each zone has a per-cpu array of pagesets.  So any particular CPU has some
      memory in each zone structure which belongs to itself.  Even if that CPU is
      not local to that zone.
      
      So the patch relocates the pagesets for each cpu to the node that is nearest
      to the cpu instead of allocating the pagesets in the (possibly remote) target
      zone.  This means that the operations to manage pages on remote zone can be
      done with information available locally.
      
      We play a macro trick so that non-NUMA pmachines avoid the additional
      pointer chase on the page allocator fastpath.
      
      AIM7 benchmark on a 32 CPU SGI Altix
      
      w/o patches:
      Tasks    jobs/min  jti  jobs/min/task      real       cpu
          1      484.68  100       484.6769     12.01      1.97   Fri Mar 25 11:01:42 2005
        100    27140.46   89       271.4046     21.44    148.71   Fri Mar 25 11:02:04 2005
        200    30792.02   82       153.9601     37.80    296.72   Fri Mar 25 11:02:42 2005
        300    32209.27   81       107.3642     54.21    451.34   Fri Mar 25 11:03:37 2005
        400    34962.83   78        87.4071     66.59    588.97   Fri Mar 25 11:04:44 2005
        500    31676.92   75        63.3538     91.87    742.71   Fri Mar 25 11:06:16 2005
        600    36032.69   73        60.0545     96.91    885.44   Fri Mar 25 11:07:54 2005
        700    35540.43   77        50.7720    114.63   1024.28   Fri Mar 25 11:09:49 2005
        800    33906.70   74        42.3834    137.32   1181.65   Fri Mar 25 11:12:06 2005
        900    34120.67   73        37.9119    153.51   1325.26   Fri Mar 25 11:14:41 2005
       1000    34802.37   74        34.8024    167.23   1465.26   Fri Mar 25 11:17:28 2005
      
      with slab API changes and pageset patch:
      
      Tasks    jobs/min  jti  jobs/min/task      real       cpu
          1      485.00  100       485.0000     12.00      1.96   Fri Mar 25 11:46:18 2005
        100    28000.96   89       280.0096     20.79    150.45   Fri Mar 25 11:46:39 2005
        200    32285.80   79       161.4290     36.05    293.37   Fri Mar 25 11:47:16 2005
        300    40424.15   84       134.7472     43.19    438.42   Fri Mar 25 11:47:59 2005
        400    39155.01   79        97.8875     59.46    590.05   Fri Mar 25 11:48:59 2005
        500    37881.25   82        75.7625     76.82    730.19   Fri Mar 25 11:50:16 2005
        600    39083.14   78        65.1386     89.35    872.79   Fri Mar 25 11:51:46 2005
        700    38627.83   77        55.1826    105.47   1022.46   Fri Mar 25 11:53:32 2005
        800    39631.94   78        49.5399    117.48   1169.94   Fri Mar 25 11:55:30 2005
        900    36903.70   79        41.0041    141.94   1310.78   Fri Mar 25 11:57:53 2005
       1000    36201.23   77        36.2012    160.77   1458.31   Fri Mar 25 12:00:34 2005
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NShobhit Dayal <shobhit@calsoftinc.com>
      Signed-off-by: NShai Fultheim <Shai@Scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e7c8d5c9
    • D
      [PATCH] Hugepage consolidation · 63551ae0
      David Gibson 提交于
      A lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This patch
      attempts to consolidate a lot of the code across the arch's, putting the
      combined version in mm/hugetlb.c.  There are a couple of uglyish hacks in
      order to covert all the hugepage archs, but the result is a very large
      reduction in the total amount of code.  It also means things like hugepage
      lazy allocation could be implemented in one place, instead of six.
      
      Tested, at least a little, on ppc64, i386 and x86_64.
      
      Notes:
      	- this patch changes the meaning of set_huge_pte() to be more
      	  analagous to set_pte()
      	- does SH4 need s special huge_ptep_get_and_clear()??
      Acked-by: NWilliam Lee Irwin <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      63551ae0
    • M
      [PATCH] VM: rate limit early reclaim · 1e7e5a90
      Martin Hicks 提交于
      When early zone reclaim is turned on the LRU is scanned more frequently when a
      zone is low on memory.  This limits when the zone reclaim can be called by
      skipping the scan if another thread (either via kswapd or sync reclaim) is
      already reclaiming from the zone.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e7e5a90
    • M
      [PATCH] VM: add __GFP_NORECLAIM · 0c35bbad
      Martin Hicks 提交于
      When using the early zone reclaim, it was noticed that allocating new pages
      that should be spread across the whole system caused eviction of local pages.
      
      This adds a new GFP flag to prevent early reclaim from happening during
      certain allocation attempts.  The example that is implemented here is for page
      cache pages.  We want page cache pages to be spread across the whole system,
      and we don't want page cache pages to evict other pages to get local memory.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c35bbad
    • M
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks 提交于
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      753ee728
    • M
      [PATCH] VM: add may_swap flag to scan_control · bfbb38fb
      Martin Hicks 提交于
      Here's the next round of these patches.  These are totally different in
      an attempt to meet the "simpler" request after the last patches.  For
      reference the earlier threads are:
      
      http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
      http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2
      
      This set of patches replaces my other vm- patches that are currently in
      -mm.  So they're against 2.6.12-rc5-mm1 about half way through the -mm
      patchset.
      
      As I said already this patch is a lot simpler.  The reclaim is turned on
      or off on a per-zone basis using a syscall.  I haven't tested the x86
      syscall, so it might be wrong.  It uses the existing reclaim/pageout
      code with the small addition of a may_swap flag to scan_control
      (patch 1/4).
      
      I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
      types can be flagged to never cause reclaim.  This was a deficiency
      that was in all of my earlier patch sets.  Previously, doing a big
      buffered read would fill one zone with page cache and then start to
      reclaim from that same zone, leaving the other zones untouched.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.c
      
      This patch:
      
      This adds an extra switch to the scan_control struct.  It simply lets the
      reclaim code know if its allowed to swap pages out.
      
      This was required for a simple per-zone reclaimer.  Without this addition
      pages would be swapped out as soon as a zone ran out of memory and the early
      reclaim kicked in.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bfbb38fb
    • N
      [PATCH] mm: add /proc/zoneinfo · 295ab934
      Nikita Danilov 提交于
      Add /proc/zoneinfo file to display information about memory zones.  Useful
      to analyze VM behaviour.
      Signed-off-by: NNikita Danilov <nikita@clusterfs.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      295ab934
    • P
      [PATCH] madvise: merge the maps · 05b74384
      Prasanna Meda 提交于
      This attempts to merge back the split maps.  This code is mostly copied
      from Chrisw's mlock merging from post 2.6.11 trees.  The only difference is
      in munmapped_error handling.  Also passed prev to willneed/dontneed,
      eventhogh they do not handle it now, since I felt it will be cleaner,
      instead of handling prev in madvise_vma in some cases and in subfunction in
      some cases.
      Signed-off-by: NPrasanna Meda <pmeda@akamai.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      05b74384
    • P
      [PATCH] madvise: do not split the maps · e798c6e8
      Prasanna Meda 提交于
      This attempts to avoid splittings when it is not needed, that is when
      vm_flags are same as new flags.  The idea is from the <2.6.11 mlock_fixup
      and others.  This will provide base for the next madvise merging patch.
      Signed-off-by: NPrasanna Meda <pmeda@akamai.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e798c6e8
    • A
      [PATCH] vmscan: notice slab shrinking · b15e0905
      akpm@osdl.org 提交于
      Fix a problem identified by Andrea Arcangeli <andrea@suse.de>
      
      kswapd will set a zone into all_unreclaimable state if it sees that we're not
      successfully reclaiming LRU pages.  But that fails to notice that we're
      successfully reclaiming slab obects, so we can set all_unreclaimable too soon.
      
      So change shrink_slab() to return a success indication if it actually
      reclaimed some objects, and don't assume that the zone is all_unreclaimable if
      that is true.  This means that we won't enter all_unreclaimable state if we
      are successfully freeing slab objects but we're not yet actually freeing slab
      pages, due to internal fragmentation.
      
      (hm, this has a shortcoming.  We could be successfully freeing ZONE_NORMAL
      slab objects while being really oom on ZONE_DMA.  If that happens then kswapd
      might burn a lot of CPU.  But given that there might be some slab objects in
      ZONE_DMA, perhaps that is appropriate.)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b15e0905
  2. 19 6月, 2005 1 次提交
  3. 07 6月, 2005 1 次提交
  4. 25 5月, 2005 1 次提交
    • W
      [PATCH] try_to_unmap_cluster() passes out-of-bounds pte to pte_unmap() · cafdd8ba
      William Lee Irwin III 提交于
      try_to_unmap_cluster() does:
              for (pte = pte_offset_map(pmd, address);
                              address < end; pte++, address += PAGE_SIZE) {
      		...
      	}
      
      	pte_unmap(pte);
      
      It may take a little staring to notice, but pte can actually fall off the
      end of the pte page in this iteration, which makes life difficult for
      kmap_atomic() and the users not expecting it to BUG().  Of course, we're
      somewhat lucky in that arithmetic elsewhere in the function guarantees that
      at least one iteration is made, lest this force larger rearrangements to be
      made.  This issue and patch also apply to non-mm mainline and with trivial
      adjustments, at least two related kernels.
      
      Discovered during internal testing at Oracle.
      Signed-off-by: NWilliam Irwin <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cafdd8ba
  5. 22 5月, 2005 1 次提交
    • S
      [PATCH] fix for __generic_file_aio_read() to return 0 on EOF · b5c44c21
      Suparna Bhattacharya 提交于
      I came across the following problem while running ltp-aiodio testcases from
      ltp-full-20050405 on linux-2.6.12-rc3-mm3.  I tried running the tests with
      EXT3 as well as JFS filesystems.
      
      One or two fsx-linux testcases were hung after some time.  These testcases
      were hanging at wait_for_all_aios().
      
      Debugging shows that there were some iocbs which were not getting completed
      eventhough the last retry for those returned -EIOCBQUEUED.  Also all such
      pending iocbs represented READ operation.
      
      Further debugging revealed that all such iocbs hit EOF in the DIO layer.
      To be more precise, the "pos" from which they were trying to read was
      greater than the "size" of the file.  So the generic_file_direct_IO
      returned 0.
      
      This happens rarely as there is already a check in
      __generic_file_aio_read(), for whether "pos" < "size" before calling direct
      IO routine.
      
      >size = i_size_read(inode);
      >if (pos < size) {
      >	  retval = generic_file_direct_IO(READ, iocb,
      >                               iov, pos, nr_segs);
      
      But for READ, we are taking the inode->i_sem only in the DIO layer.  So it
      is possible that some other process can change the size of the file before
      we take the i_sem.  In such a case ( when "pos" > "size"), the
      __generic_file_aio_read() would return -EIOCBQUEUED even though there were
      no I/O requests submitted by the DIO layer.  This would cause the AIO layer
      to expect aio_complete() for THE iocb, which doesnot happen.  And thus the
      test hangs forever, waiting for an I/O completion, where there are no
      requests submitted at all.
      
      The following patch makes __generic_file_aio_read() return 0 (instead of
      returning -EIOCBQUEUED), on getting 0 from generic_file_direct_IO(), so
      that the AIO layer does the aio_complete().
      
      Testing:
      
      I have tested the patch on a SMP machine(with 2 Pentium 4 (HT)) running
      linux-2.6.12-rc3-mm3.  I ran the ltp-aiodio testcases and none of the
      fsx-linux tests hung.  Also the aio-stress tests ran without any problem.
      Signed-off-by: NSuzuki K P <suzuki@in.ibm.com>
      Signed-off-by: NSuparna Bhattacharya <suparna@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5c44c21
  6. 21 5月, 2005 1 次提交
  7. 20 5月, 2005 1 次提交
    • L
      Fix get_unmapped_area sanity tests · 07ab67c8
      Linus Torvalds 提交于
      As noted by Chris Wright, we need to do the full range of tests regardless
      of whether MAP_FIXED is set or not, so re-organize get_unmapped_area()
      slightly to do the sanity checks unconditionally.
      07ab67c8
  8. 19 5月, 2005 1 次提交
    • L
      [PATCH] prevent NULL mmap in topdown model · 49a43876
      Linus Torvalds 提交于
      Prevent the topdown allocator from allocating mmap areas all the way
      down to address zero.
      
      We still allow a MAP_FIXED mapping of page 0 (needed for various things,
      ranging from Wine and DOSEMU to people who want to allow speculative
      loads off a NULL pointer).
      
      Tested by Chris Wright.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      49a43876
  9. 17 5月, 2005 5 次提交
  10. 06 5月, 2005 1 次提交
  11. 04 5月, 2005 1 次提交
  12. 01 5月, 2005 7 次提交
    • M
      [PATCH] DocBook: fix some descriptions · 67be2dd1
      Martin Waitz 提交于
      Some KernelDoc descriptions are updated to match the current code.
      No code changes.
      Signed-off-by: NMartin Waitz <tali@admingilde.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      67be2dd1
    • P
      [PATCH] DocBook: changes and extensions to the kernel documentation · 4dc3b16b
      Pavel Pisa 提交于
      I have recompiled Linux kernel 2.6.11.5 documentation for me and our
      university students again.  The documentation could be extended for more
      sources which are equipped by structured comments for recent 2.6 kernels.  I
      have tried to proceed with that task.  I have done that more times from 2.6.0
      time and it gets boring to do same changes again and again.  Linux kernel
      compiles after changes for i386 and ARM targets.  I have added references to
      some more files into kernel-api book, I have added some section names as well.
       So please, check that changes do not break something and that categories are
      not too much skewed.
      
      I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
      by kernel convention.  Most of the other changes are modifications in the
      comments to make kernel-doc happy, accept some parameters description and do
      not bail out on errors.  Changed <pid> to @pid in the description, moved some
      #ifdef before comments to correct function to comments bindings, etc.
      
      You can see result of the modified documentation build at
        http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
      
      Some more sources are ready to be included into kernel-doc generated
      documentation.  Sources has been added into kernel-api for now.  Some more
      section names added and probably some more chaos introduced as result of quick
      cleanup work.
      Signed-off-by: NPavel Pisa <pisa@cmp.felk.cvut.cz>
      Signed-off-by: NMartin Waitz <tali@admingilde.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4dc3b16b
    • P
      [PATCH] Change synchronize_kernel to _rcu and _sched · fbd568a3
      Paul E. McKenney 提交于
      This patch changes calls to synchronize_kernel(), deprecated in the earlier
      "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
      synchronize_rcu() and synchronize_sched() APIs.
      Signed-off-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fbd568a3
    • M
      [PATCH] Exterminate PAGE_BUG · cd7619d6
      Matt Mackall 提交于
      Remove PAGE_BUG - repalce it with BUG and BUG_ON.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cd7619d6
    • A
      [PATCH] use smp_mb/wmb/rmb where possible · d59dd462
      akpm@osdl.org 提交于
      Replace a number of memory barriers with smp_ variants.  This means we won't
      take the unnecessary hit on UP machines.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d59dd462
    • M
      [PATCH] add kmalloc_node, inline cleanup · 97e2bde4
      Manfred Spraul 提交于
      The patch makes the following function calls available to allocate memory
      on a specific node without changing the basic operation of the slab
      allocator:
      
       kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node);
       kmalloc_node(size_t size, unsigned int flags, int node);
      
      in a similar way to the existing node-blind functions:
      
       kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags);
       kmalloc(size, flags);
      
      kmem_cache_alloc_node was changed to pass flags and the node information
      through the existing layers of the slab allocator (which lead to some minor
      rearrangements).  The functions at the lowest layer (kmem_getpages,
      cache_grow) are already node aware.  Also __alloc_percpu can call
      kmalloc_node now.
      
      Performance measurements (using the pageset localization patch) yields:
      
      w/o patches:
      Tasks    jobs/min  jti  jobs/min/task      real       cpu
          1      484.27  100       484.2736     12.02      1.97   Wed Mar 30 20:50:43 2005
        100    25170.83   91       251.7083     23.12    150.10   Wed Mar 30 20:51:06 2005
        200    34601.66   84       173.0083     33.64    294.14   Wed Mar 30 20:51:40 2005
        300    37154.47   86       123.8482     46.99    436.56   Wed Mar 30 20:52:28 2005
        400    39839.82   80        99.5995     58.43    580.46   Wed Mar 30 20:53:27 2005
        500    40036.32   79        80.0726     72.68    728.60   Wed Mar 30 20:54:40 2005
        600    44074.21   79        73.4570     79.23    872.10   Wed Mar 30 20:55:59 2005
        700    44016.60   78        62.8809     92.56   1015.84   Wed Mar 30 20:57:32 2005
        800    40411.05   80        50.5138    115.22   1161.13   Wed Mar 30 20:59:28 2005
        900    42298.56   79        46.9984    123.83   1303.42   Wed Mar 30 21:01:33 2005
       1000    40955.05   80        40.9551    142.11   1441.92   Wed Mar 30 21:03:55 2005
      
      with pageset localization and slab API patches:
      Tasks    jobs/min  jti  jobs/min/task      real       cpu
          1      484.19  100       484.1930     12.02      1.98   Wed Mar 30 21:10:18 2005
        100    27428.25   92       274.2825     21.22    149.79   Wed Mar 30 21:10:40 2005
        200    37228.94   86       186.1447     31.27    293.49   Wed Mar 30 21:11:12 2005
        300    41725.42   85       139.0847     41.84    434.10   Wed Mar 30 21:11:54 2005
        400    43032.22   82       107.5805     54.10    582.06   Wed Mar 30 21:12:48 2005
        500    42211.23   83        84.4225     68.94    722.61   Wed Mar 30 21:13:58 2005
        600    40084.49   82        66.8075     87.12    873.11   Wed Mar 30 21:15:25 2005
        700    44169.30   79        63.0990     92.24   1008.77   Wed Mar 30 21:16:58 2005
        800    43097.94   79        53.8724    108.03   1155.88   Wed Mar 30 21:18:47 2005
        900    41846.75   79        46.4964    125.17   1303.38   Wed Mar 30 21:20:52 2005
       1000    40247.85   79        40.2478    144.60   1442.21   Wed Mar 30 21:23:17 2005
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      97e2bde4
    • W
      [PATCH] sync_page() smp_mb() comment · dd1d5afc
      William Lee Irwin III 提交于
      The smp_mb() is becaus sync_page() doesn't have PG_locked while it accesses
      page_mapping(page).  The comments in the patch (the entire patch is the
      addition of this comment) try to explain further how and why smp_mb() is
      used.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dd1d5afc