1. 08 12月, 2006 40 次提交
    • S
      [PATCH] swsusp: fix platform mode · 8a05aac2
      Stefan Seyfried 提交于
      At some point after 2.6.13, in-kernel software suspend got "incomplete" for
      the so-called "platform" mode.  pm_ops->prepare() is never called.  A
      visible sign of this is the "moon" light on thinkpads not flashing during
      suspend.  Fix by readding the pm_ops->prepare call during suspend.
      Signed-off-by: NStefan Seyfried <seife@suse.de>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8a05aac2
    • R
      [PATCH] swsusp: use __GFP_WAIT · 85949121
      Rafael J. Wysocki 提交于
      swsusp uses GFP_ATOMIC, but it can afford to use __GFP_WAIT, which will
      permit it to reclaim clean pagecache instead of emitting scary
      page-allocation-failure messages.
      
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      85949121
    • R
      [PATCH] swsusp: Improve handling of highmem · 8357376d
      Rafael J. Wysocki 提交于
      Currently swsusp saves the contents of highmem pages by copying them to the
      normal zone which is quite inefficient (eg.  it requires two normal pages
      to be used for saving one highmem page).  This may be improved by using
      highmem for saving the contents of saveable highmem pages.
      
      Namely, during the suspend phase of the suspend-resume cycle we try to
      allocate as many free highmem pages as there are saveable highmem pages.
      If there are not enough highmem image pages to store the contents of all of
      the saveable highmem pages, some of them will be stored in the "normal"
      memory.  Next, we allocate as many free "normal" pages as needed to store
      the (remaining) image data.  We use a memory bitmap to mark the allocated
      free pages (ie.  highmem as well as "normal" image pages).
      
      Now, we use another memory bitmap to mark all of the saveable pages
      (highmem as well as "normal") and the contents of the saveable pages are
      copied into the image pages.  Then, the second bitmap is used to save the
      pfns corresponding to the saveable pages and the first one is used to save
      their data.
      
      During the resume phase the pfns of the pages that were saveable during the
      suspend are loaded from the image and used to mark the "unsafe" page
      frames.  Next, we try to allocate as many free highmem page frames as to
      load all of the image data that had been in the highmem before the suspend
      and we allocate so many free "normal" page frames that the total number of
      allocated free pages (highmem and "normal") is equal to the size of the
      image.  While doing this we have to make sure that there will be some extra
      free "normal" and "safe" page frames for two lists of PBEs constructed
      later.
      
      Now, the image data are loaded, if possible, into their "original" page
      frames.  The image data that cannot be written into their "original" page
      frames are loaded into "safe" page frames and their "original" kernel
      virtual addresses, as well as the addresses of the "safe" pages containing
      their copies, are stored in one of two lists of PBEs.
      
      One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
      pages that were saveable during the suspend) and it is used in the same way
      as previously (ie.  by the architecture-dependent parts of swsusp).  The
      other list of PBEs is for the copies of highmem suspend pages.  The pages
      in this list are restored (in a reversible way) right before the
      arch-dependent code is called.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8357376d
    • R
      [PATCH] swsusp: update userland interface documentation · bf73bae6
      Rafael J. Wysocki 提交于
      The swsusp userland interface has recently changed for a couple of times, but
      the changes have not been documented.  Fix this, and document the
      SNAPSHOT_SET_SWAP_AREA ioctl().
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bf73bae6
    • R
      [PATCH] swsusp: add ioctl for swap files support · 37b2ba12
      Rafael J. Wysocki 提交于
      To be able to use swap files as suspend storage from the userland suspend
      tools we need an additional ioctl() that will allow us to provide the kernel
      with both the swap header's offset and the identification of the resume
      partition.
      
      The new ioctl() should be regarded as a replacement for the
      SNAPSHOT_SET_SWAP_FILE ioctl() that from now on will be considered as
      obsolete, but has to stay for backwards compatibility of the interface.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      37b2ba12
    • R
      [PATCH] swsusp: document support for swap files · ecbd0da1
      Rafael J. Wysocki 提交于
      Document the "resume_offset=" command line parameter as well as the way in
      which swap files are supported by swsusp.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ecbd0da1
    • R
      [PATCH] swsusp: add resume_offset command line parameter · 9a154d9d
      Rafael J. Wysocki 提交于
      Add the kernel command line parameter "resume_offset=" allowing us to specify
      the offset, in <PAGE_SIZE> units, from the beginning of the partition pointed
      to by the "resume=" parameter at which the swap header is located.
      
      This offset can be determined, for example, by an application using the FIBMAP
      ioctl to obtain the swap header's block number for given file.
      
      [akpm@osdl.org: we don't know what type sector_t is]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a154d9d
    • R
      [PATCH] swsusp: use block device offsets to identify swap locations · 3aef83e0
      Rafael J. Wysocki 提交于
      Make swsusp use block device offsets instead of swap offsets to identify swap
      locations and make it use the same code paths for writing as well as for
      reading data.
      
      This allows us to use the same code for handling swap files and swap
      partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3aef83e0
    • R
      [PATCH] swsusp: rearrange swap-handling code · 3fc6b34f
      Rafael J. Wysocki 提交于
      Rearrange the code in kernel/power/swap.c so that the next patch is more
      readable.
      
      [This patch only moves the existing code.]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3fc6b34f
    • R
      [PATCH] swsusp: use partition device and offset to identify swap areas · 915bae9e
      Rafael J. Wysocki 提交于
      The Linux kernel handles swap files almost in the same way as it handles swap
      partitions and there are only two differences between these two types of swap
      areas:
      
      (1) swap files need not be contiguous,
      
      (2) the header of a swap file is not in the first block of the partition
          that holds it.  From the swsusp's point of view (1) is not a problem,
          because it is already taken care of by the swap-handling code, but (2) has
          to be taken into consideration.
      
      In principle the location of a swap file's header may be determined with the
      help of appropriate filesystem driver.  Unfortunately, however, it requires
      the filesystem holding the swap file to be mounted, and if this filesystem is
      journaled, it cannot be mounted during a resume from disk.  For this reason we
      need some other means by which swap areas can be identified.
      
      For example, to identify a swap area we can use the partition that holds the
      area and the offset from the beginning of this partition at which the swap
      header is located.
      
      The following patch allows swsusp to identify swap areas this way.  It changes
      swap_type_of() so that it takes an additional argument representing an offset
      of the swap header within the partition represented by its first argument.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      915bae9e
    • S
      [PATCH] uswsusp: add pmops->{prepare,enter,finish} support (aka "platform mode") · 3592695c
      Stefan Seyfried 提交于
      Add an ioctl to the userspace swsusp code that enables the usage of the
      pmops->prepare, pmops->enter and pmops->finish methods (the in-kernel
      suspend knows these as "platform method").  These are needed on many
      machines to (among others) speed up resuming by letting the BIOS skip some
      steps or let my hp nx5000 recognise the correct ac_adapter state after
      resume again.
      
      It also ensures on many machines, that changed hardware (unplugged AC
      adapters) gets correctly detected and that kacpid does not run wild after
      resume.
      Signed-off-by: NStefan Seyfried <seife@suse.de>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3592695c
    • A
      [PATCH] alpha: switch to pci_get API · 074cec54
      Alan Cox 提交于
      Now that we have pci_get_bus_and_slot we can do the job correctly.  Note that
      some of these calls intentionally leak a device - this is because the device
      in question is always needed from boot to reboot.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      074cec54
    • M
      [PATCH] h8300 stray bracket fix · 3869aa29
      Mariusz Kozlowski 提交于
      Signed-off-by: NMariusz Kozlowski <m.kozlowski@tuxland.pl>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3869aa29
    • P
      [PATCH] avr32: fixup kprobes preemption handling · 4d3eeeac
      Paul Mundt 提交于
      While working on SH kprobes, I noticed that avr32 got the preemption
      handling wrong in the no probe case.  The idea is that upon entry of
      kprobe_handler() preemption is disabled outright across the life of the
      kprobe, only to be re-enabled in post_kprobe_handler().
      
      However, in the event that the probe is never activated, there's never any
      chance of hitting the post probe handler, which allows for the current
      avr32 implementation to disable preemption indefinitely, as it's currently
      missing a re-enable when no probe is activated.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4d3eeeac
    • A
      [PATCH] arch/frv/kernel/futex.c must #include <linux/uaccess.h> · e9c1528a
      Adrian Bunk 提交于
      This patch fixes the following compile error with
      -Werror-implicit-function-declaration
      (without -Werror-implicit-function-declaration it's a link error):
      
          ...
            CC      arch/frv/kernel/futex.o
          /home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:
          In function 'futex_atomic_op_inuser':
          /home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:203:
          error: implicit declaration of function 'pagefault_disable'
          /home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:226:
          error: implicit declaration of function 'pagefault_enable'
          make[2]: *** [arch/frv/kernel/futex.o] Error 1
          ...
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e9c1528a
    • E
      48ad504e
    • N
      [PATCH] radix-tree: RCU lockless readside · 7cf9c2c7
      Nick Piggin 提交于
      Make radix tree lookups safe to be performed without locks.  Readers are
      protected against nodes being deleted by using RCU based freeing.  Readers
      are protected against new node insertion by using memory barriers to ensure
      the node itself will be properly written before it is visible in the radix
      tree.
      
      Each radix tree node keeps a record of their height (above leaf nodes).
      This height does not change after insertion -- when the radix tree is
      extended, higher nodes are only inserted in the top.  So a lookup can take
      the pointer to what is *now* the root node, and traverse down it even if
      the tree is concurrently extended and this node becomes a subtree of a new
      root.
      
      "Direct" pointers (tree height of 0, where root->rnode points directly to
      the data item) are handled by using the low bit of the pointer to signal
      whether rnode is a direct pointer or a pointer to a radix tree node.
      
      When a reader wants to traverse the next branch, they will take a copy of
      the pointer.  This pointer will be either NULL (and the branch is empty) or
      non-NULL (and will point to a valid node).
      
      [akpm@osdl.org: cleanups]
      [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
      [clameter@sgi.com: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7cf9c2c7
    • A
      [PATCH] Save some bytes in struct mm_struct · 36de6437
      Arnaldo Carvalho de Melo 提交于
      Before:
      [acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct
      
      /* include2/asm/processor.h:542 */
      struct mm_struct {
              struct vm_area_struct *    mmap;                 /*     0     4 */
              struct rb_root             mm_rb;                /*     4     4 */
              struct vm_area_struct *    mmap_cache;           /*     8     4 */
              long unsigned int          (*get_unmapped_area)(); /*    12     4 */
              void                       (*unmap_area)();      /*    16     4 */
              long unsigned int          mmap_base;            /*    20     4 */
              long unsigned int          task_size;            /*    24     4 */
              long unsigned int          cached_hole_size;     /*    28     4 */
              /* ---------- cacheline 1 boundary ---------- */
              long unsigned int          free_area_cache;      /*    32     4 */
              pgd_t *                    pgd;                  /*    36     4 */
              atomic_t                   mm_users;             /*    40     4 */
              atomic_t                   mm_count;             /*    44     4 */
              int                        map_count;            /*    48     4 */
              struct rw_semaphore        mmap_sem;             /*    52    64 */
              spinlock_t                 page_table_lock;      /*   116    40 */
              struct list_head           mmlist;               /*   156     8 */
              mm_counter_t               _file_rss;            /*   164     4 */
              mm_counter_t               _anon_rss;            /*   168     4 */
              long unsigned int          hiwater_rss;          /*   172     4 */
              long unsigned int          hiwater_vm;           /*   176     4 */
              long unsigned int          total_vm;             /*   180     4 */
              long unsigned int          locked_vm;            /*   184     4 */
              long unsigned int          shared_vm;            /*   188     4 */
              /* ---------- cacheline 6 boundary ---------- */
              long unsigned int          exec_vm;              /*   192     4 */
              long unsigned int          stack_vm;             /*   196     4 */
              long unsigned int          reserved_vm;          /*   200     4 */
              long unsigned int          def_flags;            /*   204     4 */
              long unsigned int          nr_ptes;              /*   208     4 */
              long unsigned int          start_code;           /*   212     4 */
              long unsigned int          end_code;             /*   216     4 */
              long unsigned int          start_data;           /*   220     4 */
              /* ---------- cacheline 7 boundary ---------- */
              long unsigned int          end_data;             /*   224     4 */
              long unsigned int          start_brk;            /*   228     4 */
              long unsigned int          brk;                  /*   232     4 */
              long unsigned int          start_stack;          /*   236     4 */
              long unsigned int          arg_start;            /*   240     4 */
              long unsigned int          arg_end;              /*   244     4 */
              long unsigned int          env_start;            /*   248     4 */
              long unsigned int          env_end;              /*   252     4 */
              /* ---------- cacheline 8 boundary ---------- */
              long unsigned int          saved_auxv[44];       /*   256   176 */
              unsigned int               dumpable:2;           /*   432     4 */
              cpumask_t                  cpu_vm_mask;          /*   436     4 */
              mm_context_t               context;              /*   440    68 */
              long unsigned int          swap_token_time;      /*   508     4 */
              /* ---------- cacheline 16 boundary ---------- */
              char                       recent_pagein;        /*   512     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              int                        core_waiters;         /*   516     4 */
              struct completion *        core_startup_done;    /*   520     4 */
              struct completion          core_done;            /*   524    52 */
              rwlock_t                   ioctx_list_lock;      /*   576    36 */
              struct kioctx *            ioctx_list;           /*   612     4 */
      }; /* size: 616, sum members: 613, holes: 1, sum holes: 3, cachelines: 20,
            last cacheline: 8 bytes */
      
      After:
      
      [acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct
      /* include2/asm/processor.h:542 */
      struct mm_struct {
              struct vm_area_struct *    mmap;                 /*     0     4 */
              struct rb_root             mm_rb;                /*     4     4 */
              struct vm_area_struct *    mmap_cache;           /*     8     4 */
              long unsigned int          (*get_unmapped_area)(); /*    12     4 */
              void                       (*unmap_area)();      /*    16     4 */
              long unsigned int          mmap_base;            /*    20     4 */
              long unsigned int          task_size;            /*    24     4 */
              long unsigned int          cached_hole_size;     /*    28     4 */
              /* ---------- cacheline 1 boundary ---------- */
              long unsigned int          free_area_cache;      /*    32     4 */
              pgd_t *                    pgd;                  /*    36     4 */
              atomic_t                   mm_users;             /*    40     4 */
              atomic_t                   mm_count;             /*    44     4 */
              int                        map_count;            /*    48     4 */
              struct rw_semaphore        mmap_sem;             /*    52    64 */
              spinlock_t                 page_table_lock;      /*   116    40 */
              struct list_head           mmlist;               /*   156     8 */
              mm_counter_t               _file_rss;            /*   164     4 */
              mm_counter_t               _anon_rss;            /*   168     4 */
              long unsigned int          hiwater_rss;          /*   172     4 */
              long unsigned int          hiwater_vm;           /*   176     4 */
              long unsigned int          total_vm;             /*   180     4 */
              long unsigned int          locked_vm;            /*   184     4 */
              long unsigned int          shared_vm;            /*   188     4 */
              /* ---------- cacheline 6 boundary ---------- */
              long unsigned int          exec_vm;              /*   192     4 */
              long unsigned int          stack_vm;             /*   196     4 */
              long unsigned int          reserved_vm;          /*   200     4 */
              long unsigned int          def_flags;            /*   204     4 */
              long unsigned int          nr_ptes;              /*   208     4 */
              long unsigned int          start_code;           /*   212     4 */
              long unsigned int          end_code;             /*   216     4 */
              long unsigned int          start_data;           /*   220     4 */
              /* ---------- cacheline 7 boundary ---------- */
              long unsigned int          end_data;             /*   224     4 */
              long unsigned int          start_brk;            /*   228     4 */
              long unsigned int          brk;                  /*   232     4 */
              long unsigned int          start_stack;          /*   236     4 */
              long unsigned int          arg_start;            /*   240     4 */
              long unsigned int          arg_end;              /*   244     4 */
              long unsigned int          env_start;            /*   248     4 */
              long unsigned int          env_end;              /*   252     4 */
              /* ---------- cacheline 8 boundary ---------- */
              long unsigned int          saved_auxv[44];       /*   256   176 */
              cpumask_t                  cpu_vm_mask;          /*   432     4 */
              mm_context_t               context;              /*   436    68 */
              long unsigned int          swap_token_time;      /*   504     4 */
              char                       recent_pagein;        /*   508     1 */
              unsigned char              dumpable:2;           /*   509     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        core_waiters;         /*   512     4 */
              struct completion *        core_startup_done;    /*   516     4 */
              struct completion          core_done;            /*   520    52 */
              rwlock_t                   ioctx_list_lock;      /*   572    36 */
              struct kioctx *            ioctx_list;           /*   608     4 */
      }; /* size: 612, sum members: 610, holes: 1, sum holes: 2, cachelines: 20,
            last cacheline: 4 bytes */
      
      [acme@newtoy net-2.6.20]$ codiff -V /tmp/sched.o.before kernel/sched.o
      /pub/scm/linux/kernel/git/acme/net-2.6.20/kernel/sched.c:
        struct mm_struct |   -4
          dumpable:2;
           from: unsigned int          /*   432(30)    4(2) */
           to:   unsigned char         /*   509(6)     1(2) */
      < SNIP other offset changes >
       1 struct changed
      [acme@newtoy net-2.6.20]$
      
      I'm not aware of any problem about using 2 byte wide bitfields where
      previously a 4 byte wide one was, holler if there is any, I wouldn't be
      surprised, bitfields are things from hell.
      
      For the curious, 432(30) means: at offset 432 from the struct start, at
      offset 30 in the bitfield (yeah, it comes backwards, hellish, huh?) ditto
      for 509(6), while 4(2) and 1(2) means "struct field size(bitfield size)".
      
      Now we have a 2 bytes hole and are using only 4 bytes of the last 32
      bytes cacheline, any takers? :-)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      36de6437
    • A
      [PATCH] mm: make compound page destructor handling explicit · 33f2ef89
      Andy Whitcroft 提交于
      Currently we we use the lru head link of the second page of a compound page
      to hold its destructor.  This was ok when it was purely an internal
      implmentation detail.  However, hugetlbfs overrides this destructor
      violating the layering.  Abstract this out as explicit calls, also
      introduce a type for the callback function allowing them to be type
      checked.  For each callback we pre-declare the function, causing a type
      error on definition rather than on use elsewhere.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33f2ef89
    • C
      [PATCH] slab: better fallback allocation behavior · 3c517a61
      Christoph Lameter 提交于
      Currently we simply attempt to allocate from all allowed nodes using
      GFP_THISNODE.  However, GFP_THISNODE does not do reclaim (it wont do any at
      all if the recent GFP_THISNODE patch is accepted).  If we truly run out of
      memory in the whole system then fallback_alloc may return NULL although
      memory may still be available if we would perform more thorough reclaim.
      
      This patch changes fallback_alloc() so that we first only inspect all the
      per node queues for available slabs.  If we find any then we allocate from
      those.  This avoids slab fragmentation by first getting rid of all partial
      allocated slabs on every node before allocating new memory.
      
      If we cannot satisfy the allocation from any per node queue then we extend
      a slab.  We now call into the page allocator without specifying
      GFP_THISNODE.  The page allocator will then implement its own fallback (in
      the given cpuset context), perform necessary reclaim (again considering not
      a single node but the whole set of allowed nodes) and then return pages for
      a new slab.
      
      We identify from which node the pages were allocated and then insert the
      pages into the corresponding per node structure.  In order to do so we need
      to modify cache_grow() to take a parameter that specifies the new slab.
      kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
      able to use kmem_getpage to allocate from an arbitrary node.  GFP_THISNODE
      needs to be specified when calling cache_grow().
      
      One key advantage is that the decision from which node to allocate new
      memory is removed from slab fallback processing.  The patch allows to go
      back to use of the page allocators fallback/reclaim logic.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3c517a61
    • C
      [PATCH] GFP_THISNODE must not trigger global reclaim · 952f3b51
      Christoph Lameter 提交于
      The intent of GFP_THISNODE is to make sure that an allocation occurs on a
      particular node.  If this is not possible then NULL needs to be returned so
      that the caller can choose what to do next on its own (the slab allocator
      depends on that).
      
      However, GFP_THISNODE currently triggers reclaim before returning a failure
      (GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
      then we will currently do some reclaim before returning NULL.  The caller
      may want memory from other nodes before reclaim should be triggered.  (If
      the caller wants reclaim then he can directly use __GFP_THISNODE instead).
      
      There is no flag to avoid reclaim in the page allocator and adding yet
      another GFP_xx flag would be difficult given that we are out of available
      flags.
      
      So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
      __GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
      waking up kswapd.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      952f3b51
    • C
      [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_node · 5bcd234d
      Christoph Lameter 提交于
      This addresses two issues:
      
      1. Kmalloc_node() may intermittently return NULL if we are allocating
         from the current node and are unable to obtain memory for the current
         node from the page allocator.  This is because we call ___cache_alloc()
         if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
         to other nodes.
      
         This was introduced in the 2.6.19 development cycle.  <= 2.6.18 in
         that case does not do a restricted allocation and blindly trusts the
         page allocator to have given us memory from the indicated node.  It
         inserts the page regardless of the node it came from into the queues for
         the current node.
      
      2. If kmalloc_node() is used on a node that has not been bootstrapped
         yet then we may try to pass an invalid node number to
         ____cache_alloc_node() triggering a BUG().
      
         Change the function to call fallback_alloc() instead.  Only call
         fallback_alloc() if we are allowed to fallback at all.  The need to
         handle a node not bootstrapped yet also first surfaced in the 2.6.19
         cycle.
      
      Update the comments since they were still describing the old kmalloc_node
      from 2.6.12.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5bcd234d
    • A
      [PATCH] slab: deprecate kmem_cache_t · 1b1cec4b
      Andrew Morton 提交于
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b1cec4b
    • C
      [PATCH] slab: remove kmem_cache_t · e18b890b
      Christoph Lameter 提交于
      Replace all uses of kmem_cache_t with struct kmem_cache.
      
      The patch was generated using the following script:
      
      	#!/bin/sh
      	#
      	# Replace one string by another in all the kernel sources.
      	#
      
      	set -e
      
      	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
      		quilt add $file
      		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
      		mv /tmp/$$ $file
      		quilt refresh
      	done
      
      The script was run like this
      
      	sh replace kmem_cache_t "struct kmem_cache"
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e18b890b
    • C
      [PATCH] slab: remove SLAB_DMA · 441e143e
      Christoph Lameter 提交于
      SLAB_DMA is an alias of GFP_DMA. This is the last one so we
      remove the leftover comment too.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      441e143e
    • C
      [PATCH] slab: remove SLAB_KERNEL · e94b1766
      Christoph Lameter 提交于
      SLAB_KERNEL is an alias of GFP_KERNEL.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e94b1766
    • C
      [PATCH] slab: remove SLAB_ATOMIC · 54e6ecb2
      Christoph Lameter 提交于
      SLAB_ATOMIC is an alias of GFP_ATOMIC
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      54e6ecb2
    • C
      [PATCH] slab: remove SLAB_USER · f7267c0c
      Christoph Lameter 提交于
      SLAB_USER is an alias of GFP_USER
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f7267c0c
    • C
      [PATCH] slab: remove SLAB_NOFS · e6b4f8da
      Christoph Lameter 提交于
      SLAB_NOFS is an alias of GFP_NOFS.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e6b4f8da
    • C
      [PATCH] slab: remove SLAB_NOIO · 55acbda0
      Christoph Lameter 提交于
      SLAB_NOIO is an alias of GFP_NOIO with a single instance of use.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55acbda0
    • C
      [PATCH] slab: remove SLAB_LEVEL_MASK · a06d72c1
      Christoph Lameter 提交于
      SLAB_LEVEL_MASK is only used internally to the slab and is
      and alias of GFP_LEVEL_MASK.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a06d72c1
    • C
      [PATCH] slab: remove SLAB_NO_GROW · 6e0eaa4b
      Christoph Lameter 提交于
      It is only used internally in the slab.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6e0eaa4b
    • H
      [PATCH] kill install_file_pte's pte_val · 2d4d862f
      Hugh Dickins 提交于
      David Binderman and his Intel C compiler rightly observe that
      install_file_pte no longer has any use for its pte_val.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: d binderman <dcb314@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2d4d862f
    • A
      [PATCH] mm: cleanup indentation on switch for CPU operations · ce421c79
      Andy Whitcroft 提交于
      These patches introduced new switch statements which are indented contrary
      to the concensus in mm/*.c.  Fix them up to match that concensus.
      
          [PATCH] node local per-cpu-pages
          [PATCH] ZVC: Scale thresholds depending on the size of the system
          commit e7c8d5c9
          commit df9ecabaSigned-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce421c79
    • E
      [PATCH] reject corrupt swapfiles earlier · 5d1854e1
      Eric Sandeen 提交于
      The fsfuzzer found this; with a corrupt small swapfile that claims to have
      many pages:
      
        [root]# file swap.741.img
        swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
        [root]# ls -l swap.741.img
        -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img
      
      sys_swapon() will try to vmalloc all those pages, and -then- check to see if
      the file is actually that large:
      
                      if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
        <snip>
              if (swapfilesize && maxpages > swapfilesize) {
                      printk(KERN_WARNING
                             "Swap area shorter than signature indicates\n");
      
      It seems to me that it would make more sense to move this test up before
      the vmalloc, with the other checks, to avoid the OOM-killer in this
      situation...
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5d1854e1
    • A
      [PATCH] silence unused pgdat warning from alloc_bootmem_node and friends · 4af2bfc1
      Andy Whitcroft 提交于
      x86 NUMA systems only define bootmem for node 0.  alloc_bootmem_node() and
      friends therefore ignore the passed pgdat and use NODE_DATA(0) in all
      cases.  This leads to the following warnings as we are not using the passed
      parameter:
      
        .../mm/page_alloc.c: In function 'zone_wait_table_init':
        .../mm/page_alloc.c:2259: warning: unused variable 'pgdat'
      
      One option would be to define all variables used with these macros
      __attribute__ ((unused)), but this would leave us exposed should these
      become genuinely unused.
      
      The key here is that we _are_ using the value, we ignore it but that is a
      deliberate action.  This patch adds a nested local variable within the
      alloc_bootmem_node helper to which the pgdat parameter is assigned making
      it 'used'.  The nested local is marked __attribute__ ((unused)) to silence
      this same warning for it.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4af2bfc1
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • C
      [PATCH] drain_node_page(): Drain pages in batch units · bc4ba393
      Christoph Lameter 提交于
      drain_node_pages() currently drains the complete pageset of all pages.  If
      there are a large number of pages in the queues then we may hold off
      interrupts for too long.
      
      Duplicate the method used in free_hot_cold_page.  Only drain pcp->batch
      pages at one time.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bc4ba393
    • C
      [PATCH] Remove uses of kmem_cache_t from mm/* and include/linux/slab.h · ebe29738
      Christoph Lameter 提交于
      Remove all uses of kmem_cache_t (the most were left in slab.h).  The
      typedef for kmem_cache_t is then only necessary for other kernel
      subsystems.  Add a comment to that effect.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ebe29738
    • C
      [PATCH] Move names_cachep to linux/fs.h · b86c089b
      Christoph Lameter 提交于
      The names_cachep is used for getname() and putname().  So lets put it into
      fs.h near those two definitions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b86c089b