1. 17 12月, 2006 2 次提交
    • L
      Fix up mm/mincore.c error value cases · 4fb23e43
      Linus Torvalds 提交于
      Hugh Dickins correctly points out that mincore() is actually _supposed_
      to fail on an unmapped hole in the user address space, rather than
      return valid ("empty") information about the hole.  This just simplifies
      the problem further (I had been misled by our previous confusing and
      complicated way of doing mincore()).
      
      Also, in the unlikely situation that we can't allocate a temporary
      kernel buffer, we should actually return EAGAIN, not ENOMEM, to keep the
      "unmapped hole" and "allocation failure" error cases separate.
      
      Finally, add a comment about our stupid historical lack of support for
      anonymous mappings.  I'll fix that if somebody reminds me after 2.6.20
      is out.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4fb23e43
    • L
      Fix incorrect user space access locking in mincore() · 2f77d107
      Linus Torvalds 提交于
      Doug Chapman noticed that mincore() will doa "copy_to_user()" of the
      result while holding the mmap semaphore for reading, which is a big
      no-no.  While a recursive read-lock on a semaphore in the case of a page
      fault happens to work, we don't actually allow them due to deadlock
      schenarios with writers due to fairness issues.
      
      Doug and Marcel sent in a patch to fix it, but I decided to just rewrite
      the mess instead - not just fixing the locking problem, but making the
      code smaller and (imho) much easier to understand.
      
      Cc: Doug Chapman <dchapman@redhat.com>
      Cc: Marcel Holtmann <holtmann@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2f77d107
  2. 14 12月, 2006 6 次提交
    • A
      [PATCH] Pass vma argument to copy_user_highpage(). · 9de455b2
      Atsushi Nemoto 提交于
      To allow a more effective copy_user_highpage() on certain architectures,
      a vma argument is added to the function and cow_user_page() allowing
      the implementation of these functions to check for the VM_EXEC bit.
      
      The main part of this patch was originally written by Ralf Baechle;
      Atushi Nemoto did the the debugging.
      Signed-off-by: NAtsushi Nemoto <anemo@mba.ocn.ne.jp>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9de455b2
    • E
      [PATCH] SLAB: use a multiply instead of a divide in obj_to_index() · 6a2d7a95
      Eric Dumazet 提交于
      When some objects are allocated by one CPU but freed by another CPU we can
      consume lot of cycles doing divides in obj_to_index().
      
      (Typical load on a dual processor machine where network interrupts are
      handled by one particular CPU (allocating skbufs), and the other CPU is
      running the application (consuming and freeing skbufs))
      
      Here on one production server (dual-core AMD Opteron 285), I noticed this
      divide took 1.20 % of CPU_CLK_UNHALTED events in kernel.  But Opteron are
      quite modern cpus and the divide is much more expensive on oldest
      architectures :
      
      On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
      cycle for a multiply.
      
      Doing some math, we can use a reciprocal multiplication instead of a divide.
      
      If we want to compute V = (A / B)  (A and B being u32 quantities)
      we can instead use :
      
      V = ((u64)A * RECIPROCAL(B)) >> 32 ;
      
      where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B
      
      Note :
      
      I wrote pure C code for clarity. gcc output for i386 is not optimal but
      acceptable :
      
      mull   0x14(%ebx)
      mov    %edx,%eax // part of the >> 32
      xor     %edx,%edx // useless
      mov    %eax,(%esp) // could be avoided
      mov    %edx,0x4(%esp) // useless
      mov    (%esp),%ebx
      
      [akpm@osdl.org: small cleanups]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6a2d7a95
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
    • C
      [PATCH] More slab.h cleanups · 55935a34
      Christoph Lameter 提交于
      More cleanups for slab.h
      
      1. Remove tabs from weird locations as suggested by Pekka
      
      2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
         as suggested by Pekka.
      
      3. Uses static inline for the fallback defs as also suggested by Pekka.
      
      4. Make kmem_ptr_valid take a const * argument.
      
      5. Separate the NUMA fallback definitions from the kmalloc_track fallback
         definitions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55935a34
    • C
      [PATCH] Cleanup slab headers / API to allow easy addition of new slab allocators · 2e892f43
      Christoph Lameter 提交于
      This is a response to an earlier discussion on linux-mm about splitting
      slab.h components per allocator.  Patch is against 2.6.19-git11.  See
      http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2
      
      This patch cleans up the slab header definitions.  We define the common
      functions of slob and slab in slab.h and put the extra definitions needed
      for slab's kmalloc implementations in <linux/slab_def.h>.  In order to get
      a greater set of common functions we add several empty functions to slob.c
      and also rename slob's kmalloc to __kmalloc.
      
      Slob does not need any special definitions since we introduce a fallback
      case.  If there is no need for a slab implementation to provide its own
      kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
      That is sufficient for SLOB.
      
      Sort the function in slab.h according to their functionality.  First the
      functions operating on struct kmem_cache * then the kmalloc related
      functions followed by special debug and fallback definitions.
      
      Also redo a lot of comments.
      
      Signed-off-by: Christoph Lameter <clameter@sgi.com>?
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2e892f43
    • C
      [PATCH] slab: fix sleeping in atomic bug · dd47ea75
      Christoph Lameter 提交于
      Fallback_alloc() does not do the check for GFP_WAIT as done in
      cache_grow().  Thus interrupts are disabled when we call kmem_getpages()
      which results in the failure.
      
      Duplicate the handling of GFP_WAIT in cache_grow().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Jay Cliburn <jacliburn@bellsouth.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dd47ea75
  3. 11 12月, 2006 7 次提交
    • A
      [PATCH] user of the jiffies rounding patch: Slab · 2b284214
      Arjan van de Ven 提交于
      This patch introduces users of the round_jiffies() function in the slab code.
      
      The slab code has a few "run every second" timers for background work; these
      are obviously not timing critical as long as they happen roughly at the right
      frequency.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2b284214
    • Z
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown 提交于
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • A
      [PATCH] io-accounting-read-accounting nfs fix · 8bde37f0
      Andrew Morton 提交于
      nfs's ->readpages uses read_cache_pages().  Wire it up there.
      
      [wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads]
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8bde37f0
    • A
      [PATCH] io-accounting: write-cancel accounting · e08748ce
      Andrew Morton 提交于
      Account for the number of byte writes which this process caused to not happen
      after all.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e08748ce
    • A
      [PATCH] io-accounting: write accounting · 55e829af
      Andrew Morton 提交于
      Accounting writes is fairly simple: whenever a process flips a page from clean
      to dirty, we accuse it of having caused a write to underlying storage of
      PAGE_CACHE_SIZE bytes.
      
      This may overestimate the amount of writing: the page-dirtying may cause only
      one buffer_head's worth of writeout.  Fixing that is possible, but probably a
      bit messy and isn't obviously important.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55e829af
    • A
      [PATCH] clean up __set_page_dirty_nobuffers() · 8c08540f
      Andrew Morton 提交于
      Save a tabstop in __set_page_dirty_nobuffers() and __set_page_dirty_buffers()
      and a few other places.  No functional changes.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8c08540f
    • H
      [PATCH] read_zero_pagealigned() locking fix · 5fcf7bb7
      Hugh Dickins 提交于
      Ramiro Voicu hits the BUG_ON(!pte_none(*pte)) in zeromap_pte_range: kernel
      bugzilla 7645.  Right: read_zero_pagealigned uses down_read of mmap_sem,
      but another thread's racing read of /dev/zero, or a normal fault, can
      easily set that pte again, in between zap_page_range and zeromap_page_range
      getting there.  It's been wrong ever since 2.4.3.
      
      The simple fix is to use down_write instead, but that would serialize reads
      of /dev/zero more than at present: perhaps some app would be badly
      affected.  So instead let zeromap_page_range return the error instead of
      BUG_ON, and read_zero_pagealigned break to the slower clear_user loop in
      that case - there's no need to optimize for it.
      
      Use -EEXIST for when a pte is found: BUG_ON in mmap_zero (the other user of
      zeromap_page_range), though it really isn't interesting there.  And since
      mmap_zero wants -EAGAIN for out-of-memory, the zeromaps better return that
      than -ENOMEM.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Ramiro Voicu: <Ramiro.Voicu@cern.ch>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5fcf7bb7
  4. 09 12月, 2006 7 次提交
  5. 08 12月, 2006 18 次提交
    • H
      [PATCH] struct seq_operations and struct file_operations constification · 15ad7cdc
      Helge Deller 提交于
       - move some file_operations structs into the .rodata section
      
       - move static strings from policy_types[] array into the .rodata section
      
       - fix generic seq_operations usages, so that those structs may be defined
         as "const" as well
      
      [akpm@osdl.org: couple of fixes]
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      15ad7cdc
    • A
      [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols · 045f147f
      Adrian Bunk 提交于
      In time for 2.6.20, we can get rid of this junk.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      045f147f
    • B
      4668edc3
    • A
      1f370a23
    • I
      [PATCH] hotplug CPU: clean up hotcpu_notifier() use · 02316067
      Ingo Molnar 提交于
      There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
      prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
      generating compiler warnings of unused symbols, hence forcing people to add
      #ifdefs.
      
      the compiler can skip truly unused functions just fine:
      
          text    data     bss     dec     hex filename
       1624412  728710 3674856 6027978  5bfaca vmlinux.before
       1624412  728710 3674856 6027978  5bfaca vmlinux.after
      
      [akpm@osdl.org: topology.c fix]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02316067
    • A
      [PATCH] remove HASH_HIGHMEM · 04903664
      Andrew Morton 提交于
      It has no users and it's doubtful that we'll need it again.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04903664
    • O
      [PATCH] read_cache_pages() cleanup · 38da288b
      OGAWA Hirofumi 提交于
      Use put_pages_list() instead of opencoding it.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      38da288b
    • A
      [PATCH] slab: use probe_kernel_address() · 138ae663
      Andrew Morton 提交于
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      138ae663
    • N
      [PATCH] Add include/linux/freezer.h and move definitions from sched.h · 7dfb7103
      Nigel Cunningham 提交于
      Move process freezing functions from include/linux/sched.h to freezer.h, so
      that modifications to the freezer or the kernel configuration don't require
      recompiling just about everything.
      
      [akpm@osdl.org: fix ueagle driver]
      Signed-off-by: NNigel Cunningham <nigel@suspend2.net>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7dfb7103
    • R
      [PATCH] swsusp: Improve handling of highmem · 8357376d
      Rafael J. Wysocki 提交于
      Currently swsusp saves the contents of highmem pages by copying them to the
      normal zone which is quite inefficient (eg.  it requires two normal pages
      to be used for saving one highmem page).  This may be improved by using
      highmem for saving the contents of saveable highmem pages.
      
      Namely, during the suspend phase of the suspend-resume cycle we try to
      allocate as many free highmem pages as there are saveable highmem pages.
      If there are not enough highmem image pages to store the contents of all of
      the saveable highmem pages, some of them will be stored in the "normal"
      memory.  Next, we allocate as many free "normal" pages as needed to store
      the (remaining) image data.  We use a memory bitmap to mark the allocated
      free pages (ie.  highmem as well as "normal" image pages).
      
      Now, we use another memory bitmap to mark all of the saveable pages
      (highmem as well as "normal") and the contents of the saveable pages are
      copied into the image pages.  Then, the second bitmap is used to save the
      pfns corresponding to the saveable pages and the first one is used to save
      their data.
      
      During the resume phase the pfns of the pages that were saveable during the
      suspend are loaded from the image and used to mark the "unsafe" page
      frames.  Next, we try to allocate as many free highmem page frames as to
      load all of the image data that had been in the highmem before the suspend
      and we allocate so many free "normal" page frames that the total number of
      allocated free pages (highmem and "normal") is equal to the size of the
      image.  While doing this we have to make sure that there will be some extra
      free "normal" and "safe" page frames for two lists of PBEs constructed
      later.
      
      Now, the image data are loaded, if possible, into their "original" page
      frames.  The image data that cannot be written into their "original" page
      frames are loaded into "safe" page frames and their "original" kernel
      virtual addresses, as well as the addresses of the "safe" pages containing
      their copies, are stored in one of two lists of PBEs.
      
      One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
      pages that were saveable during the suspend) and it is used in the same way
      as previously (ie.  by the architecture-dependent parts of swsusp).  The
      other list of PBEs is for the copies of highmem suspend pages.  The pages
      in this list are restored (in a reversible way) right before the
      arch-dependent code is called.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8357376d
    • R
      [PATCH] swsusp: use block device offsets to identify swap locations · 3aef83e0
      Rafael J. Wysocki 提交于
      Make swsusp use block device offsets instead of swap offsets to identify swap
      locations and make it use the same code paths for writing as well as for
      reading data.
      
      This allows us to use the same code for handling swap files and swap
      partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3aef83e0
    • R
      [PATCH] swsusp: use partition device and offset to identify swap areas · 915bae9e
      Rafael J. Wysocki 提交于
      The Linux kernel handles swap files almost in the same way as it handles swap
      partitions and there are only two differences between these two types of swap
      areas:
      
      (1) swap files need not be contiguous,
      
      (2) the header of a swap file is not in the first block of the partition
          that holds it.  From the swsusp's point of view (1) is not a problem,
          because it is already taken care of by the swap-handling code, but (2) has
          to be taken into consideration.
      
      In principle the location of a swap file's header may be determined with the
      help of appropriate filesystem driver.  Unfortunately, however, it requires
      the filesystem holding the swap file to be mounted, and if this filesystem is
      journaled, it cannot be mounted during a resume from disk.  For this reason we
      need some other means by which swap areas can be identified.
      
      For example, to identify a swap area we can use the partition that holds the
      area and the offset from the beginning of this partition at which the swap
      header is located.
      
      The following patch allows swsusp to identify swap areas this way.  It changes
      swap_type_of() so that it takes an additional argument representing an offset
      of the swap header within the partition represented by its first argument.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      915bae9e
    • N
      [PATCH] radix-tree: RCU lockless readside · 7cf9c2c7
      Nick Piggin 提交于
      Make radix tree lookups safe to be performed without locks.  Readers are
      protected against nodes being deleted by using RCU based freeing.  Readers
      are protected against new node insertion by using memory barriers to ensure
      the node itself will be properly written before it is visible in the radix
      tree.
      
      Each radix tree node keeps a record of their height (above leaf nodes).
      This height does not change after insertion -- when the radix tree is
      extended, higher nodes are only inserted in the top.  So a lookup can take
      the pointer to what is *now* the root node, and traverse down it even if
      the tree is concurrently extended and this node becomes a subtree of a new
      root.
      
      "Direct" pointers (tree height of 0, where root->rnode points directly to
      the data item) are handled by using the low bit of the pointer to signal
      whether rnode is a direct pointer or a pointer to a radix tree node.
      
      When a reader wants to traverse the next branch, they will take a copy of
      the pointer.  This pointer will be either NULL (and the branch is empty) or
      non-NULL (and will point to a valid node).
      
      [akpm@osdl.org: cleanups]
      [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
      [clameter@sgi.com: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7cf9c2c7
    • A
      [PATCH] mm: make compound page destructor handling explicit · 33f2ef89
      Andy Whitcroft 提交于
      Currently we we use the lru head link of the second page of a compound page
      to hold its destructor.  This was ok when it was purely an internal
      implmentation detail.  However, hugetlbfs overrides this destructor
      violating the layering.  Abstract this out as explicit calls, also
      introduce a type for the callback function allowing them to be type
      checked.  For each callback we pre-declare the function, causing a type
      error on definition rather than on use elsewhere.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33f2ef89
    • C
      [PATCH] slab: better fallback allocation behavior · 3c517a61
      Christoph Lameter 提交于
      Currently we simply attempt to allocate from all allowed nodes using
      GFP_THISNODE.  However, GFP_THISNODE does not do reclaim (it wont do any at
      all if the recent GFP_THISNODE patch is accepted).  If we truly run out of
      memory in the whole system then fallback_alloc may return NULL although
      memory may still be available if we would perform more thorough reclaim.
      
      This patch changes fallback_alloc() so that we first only inspect all the
      per node queues for available slabs.  If we find any then we allocate from
      those.  This avoids slab fragmentation by first getting rid of all partial
      allocated slabs on every node before allocating new memory.
      
      If we cannot satisfy the allocation from any per node queue then we extend
      a slab.  We now call into the page allocator without specifying
      GFP_THISNODE.  The page allocator will then implement its own fallback (in
      the given cpuset context), perform necessary reclaim (again considering not
      a single node but the whole set of allowed nodes) and then return pages for
      a new slab.
      
      We identify from which node the pages were allocated and then insert the
      pages into the corresponding per node structure.  In order to do so we need
      to modify cache_grow() to take a parameter that specifies the new slab.
      kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
      able to use kmem_getpage to allocate from an arbitrary node.  GFP_THISNODE
      needs to be specified when calling cache_grow().
      
      One key advantage is that the decision from which node to allocate new
      memory is removed from slab fallback processing.  The patch allows to go
      back to use of the page allocators fallback/reclaim logic.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3c517a61
    • C
      [PATCH] GFP_THISNODE must not trigger global reclaim · 952f3b51
      Christoph Lameter 提交于
      The intent of GFP_THISNODE is to make sure that an allocation occurs on a
      particular node.  If this is not possible then NULL needs to be returned so
      that the caller can choose what to do next on its own (the slab allocator
      depends on that).
      
      However, GFP_THISNODE currently triggers reclaim before returning a failure
      (GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
      then we will currently do some reclaim before returning NULL.  The caller
      may want memory from other nodes before reclaim should be triggered.  (If
      the caller wants reclaim then he can directly use __GFP_THISNODE instead).
      
      There is no flag to avoid reclaim in the page allocator and adding yet
      another GFP_xx flag would be difficult given that we are out of available
      flags.
      
      So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
      __GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
      waking up kswapd.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      952f3b51
    • C
      [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_node · 5bcd234d
      Christoph Lameter 提交于
      This addresses two issues:
      
      1. Kmalloc_node() may intermittently return NULL if we are allocating
         from the current node and are unable to obtain memory for the current
         node from the page allocator.  This is because we call ___cache_alloc()
         if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
         to other nodes.
      
         This was introduced in the 2.6.19 development cycle.  <= 2.6.18 in
         that case does not do a restricted allocation and blindly trusts the
         page allocator to have given us memory from the indicated node.  It
         inserts the page regardless of the node it came from into the queues for
         the current node.
      
      2. If kmalloc_node() is used on a node that has not been bootstrapped
         yet then we may try to pass an invalid node number to
         ____cache_alloc_node() triggering a BUG().
      
         Change the function to call fallback_alloc() instead.  Only call
         fallback_alloc() if we are allowed to fallback at all.  The need to
         handle a node not bootstrapped yet also first surfaced in the 2.6.19
         cycle.
      
      Update the comments since they were still describing the old kmalloc_node
      from 2.6.12.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5bcd234d
    • C
      [PATCH] slab: remove SLAB_DMA · 441e143e
      Christoph Lameter 提交于
      SLAB_DMA is an alias of GFP_DMA. This is the last one so we
      remove the leftover comment too.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      441e143e