1. 08 12月, 2006 38 次提交
    • R
      [PATCH] swsusp: Improve handling of highmem · 8357376d
      Rafael J. Wysocki 提交于
      Currently swsusp saves the contents of highmem pages by copying them to the
      normal zone which is quite inefficient (eg.  it requires two normal pages
      to be used for saving one highmem page).  This may be improved by using
      highmem for saving the contents of saveable highmem pages.
      
      Namely, during the suspend phase of the suspend-resume cycle we try to
      allocate as many free highmem pages as there are saveable highmem pages.
      If there are not enough highmem image pages to store the contents of all of
      the saveable highmem pages, some of them will be stored in the "normal"
      memory.  Next, we allocate as many free "normal" pages as needed to store
      the (remaining) image data.  We use a memory bitmap to mark the allocated
      free pages (ie.  highmem as well as "normal" image pages).
      
      Now, we use another memory bitmap to mark all of the saveable pages
      (highmem as well as "normal") and the contents of the saveable pages are
      copied into the image pages.  Then, the second bitmap is used to save the
      pfns corresponding to the saveable pages and the first one is used to save
      their data.
      
      During the resume phase the pfns of the pages that were saveable during the
      suspend are loaded from the image and used to mark the "unsafe" page
      frames.  Next, we try to allocate as many free highmem page frames as to
      load all of the image data that had been in the highmem before the suspend
      and we allocate so many free "normal" page frames that the total number of
      allocated free pages (highmem and "normal") is equal to the size of the
      image.  While doing this we have to make sure that there will be some extra
      free "normal" and "safe" page frames for two lists of PBEs constructed
      later.
      
      Now, the image data are loaded, if possible, into their "original" page
      frames.  The image data that cannot be written into their "original" page
      frames are loaded into "safe" page frames and their "original" kernel
      virtual addresses, as well as the addresses of the "safe" pages containing
      their copies, are stored in one of two lists of PBEs.
      
      One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
      pages that were saveable during the suspend) and it is used in the same way
      as previously (ie.  by the architecture-dependent parts of swsusp).  The
      other list of PBEs is for the copies of highmem suspend pages.  The pages
      in this list are restored (in a reversible way) right before the
      arch-dependent code is called.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8357376d
    • R
      [PATCH] swsusp: use block device offsets to identify swap locations · 3aef83e0
      Rafael J. Wysocki 提交于
      Make swsusp use block device offsets instead of swap offsets to identify swap
      locations and make it use the same code paths for writing as well as for
      reading data.
      
      This allows us to use the same code for handling swap files and swap
      partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3aef83e0
    • R
      [PATCH] swsusp: use partition device and offset to identify swap areas · 915bae9e
      Rafael J. Wysocki 提交于
      The Linux kernel handles swap files almost in the same way as it handles swap
      partitions and there are only two differences between these two types of swap
      areas:
      
      (1) swap files need not be contiguous,
      
      (2) the header of a swap file is not in the first block of the partition
          that holds it.  From the swsusp's point of view (1) is not a problem,
          because it is already taken care of by the swap-handling code, but (2) has
          to be taken into consideration.
      
      In principle the location of a swap file's header may be determined with the
      help of appropriate filesystem driver.  Unfortunately, however, it requires
      the filesystem holding the swap file to be mounted, and if this filesystem is
      journaled, it cannot be mounted during a resume from disk.  For this reason we
      need some other means by which swap areas can be identified.
      
      For example, to identify a swap area we can use the partition that holds the
      area and the offset from the beginning of this partition at which the swap
      header is located.
      
      The following patch allows swsusp to identify swap areas this way.  It changes
      swap_type_of() so that it takes an additional argument representing an offset
      of the swap header within the partition represented by its first argument.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      915bae9e
    • N
      [PATCH] radix-tree: RCU lockless readside · 7cf9c2c7
      Nick Piggin 提交于
      Make radix tree lookups safe to be performed without locks.  Readers are
      protected against nodes being deleted by using RCU based freeing.  Readers
      are protected against new node insertion by using memory barriers to ensure
      the node itself will be properly written before it is visible in the radix
      tree.
      
      Each radix tree node keeps a record of their height (above leaf nodes).
      This height does not change after insertion -- when the radix tree is
      extended, higher nodes are only inserted in the top.  So a lookup can take
      the pointer to what is *now* the root node, and traverse down it even if
      the tree is concurrently extended and this node becomes a subtree of a new
      root.
      
      "Direct" pointers (tree height of 0, where root->rnode points directly to
      the data item) are handled by using the low bit of the pointer to signal
      whether rnode is a direct pointer or a pointer to a radix tree node.
      
      When a reader wants to traverse the next branch, they will take a copy of
      the pointer.  This pointer will be either NULL (and the branch is empty) or
      non-NULL (and will point to a valid node).
      
      [akpm@osdl.org: cleanups]
      [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
      [clameter@sgi.com: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7cf9c2c7
    • A
      [PATCH] Save some bytes in struct mm_struct · 36de6437
      Arnaldo Carvalho de Melo 提交于
      Before:
      [acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct
      
      /* include2/asm/processor.h:542 */
      struct mm_struct {
              struct vm_area_struct *    mmap;                 /*     0     4 */
              struct rb_root             mm_rb;                /*     4     4 */
              struct vm_area_struct *    mmap_cache;           /*     8     4 */
              long unsigned int          (*get_unmapped_area)(); /*    12     4 */
              void                       (*unmap_area)();      /*    16     4 */
              long unsigned int          mmap_base;            /*    20     4 */
              long unsigned int          task_size;            /*    24     4 */
              long unsigned int          cached_hole_size;     /*    28     4 */
              /* ---------- cacheline 1 boundary ---------- */
              long unsigned int          free_area_cache;      /*    32     4 */
              pgd_t *                    pgd;                  /*    36     4 */
              atomic_t                   mm_users;             /*    40     4 */
              atomic_t                   mm_count;             /*    44     4 */
              int                        map_count;            /*    48     4 */
              struct rw_semaphore        mmap_sem;             /*    52    64 */
              spinlock_t                 page_table_lock;      /*   116    40 */
              struct list_head           mmlist;               /*   156     8 */
              mm_counter_t               _file_rss;            /*   164     4 */
              mm_counter_t               _anon_rss;            /*   168     4 */
              long unsigned int          hiwater_rss;          /*   172     4 */
              long unsigned int          hiwater_vm;           /*   176     4 */
              long unsigned int          total_vm;             /*   180     4 */
              long unsigned int          locked_vm;            /*   184     4 */
              long unsigned int          shared_vm;            /*   188     4 */
              /* ---------- cacheline 6 boundary ---------- */
              long unsigned int          exec_vm;              /*   192     4 */
              long unsigned int          stack_vm;             /*   196     4 */
              long unsigned int          reserved_vm;          /*   200     4 */
              long unsigned int          def_flags;            /*   204     4 */
              long unsigned int          nr_ptes;              /*   208     4 */
              long unsigned int          start_code;           /*   212     4 */
              long unsigned int          end_code;             /*   216     4 */
              long unsigned int          start_data;           /*   220     4 */
              /* ---------- cacheline 7 boundary ---------- */
              long unsigned int          end_data;             /*   224     4 */
              long unsigned int          start_brk;            /*   228     4 */
              long unsigned int          brk;                  /*   232     4 */
              long unsigned int          start_stack;          /*   236     4 */
              long unsigned int          arg_start;            /*   240     4 */
              long unsigned int          arg_end;              /*   244     4 */
              long unsigned int          env_start;            /*   248     4 */
              long unsigned int          env_end;              /*   252     4 */
              /* ---------- cacheline 8 boundary ---------- */
              long unsigned int          saved_auxv[44];       /*   256   176 */
              unsigned int               dumpable:2;           /*   432     4 */
              cpumask_t                  cpu_vm_mask;          /*   436     4 */
              mm_context_t               context;              /*   440    68 */
              long unsigned int          swap_token_time;      /*   508     4 */
              /* ---------- cacheline 16 boundary ---------- */
              char                       recent_pagein;        /*   512     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              int                        core_waiters;         /*   516     4 */
              struct completion *        core_startup_done;    /*   520     4 */
              struct completion          core_done;            /*   524    52 */
              rwlock_t                   ioctx_list_lock;      /*   576    36 */
              struct kioctx *            ioctx_list;           /*   612     4 */
      }; /* size: 616, sum members: 613, holes: 1, sum holes: 3, cachelines: 20,
            last cacheline: 8 bytes */
      
      After:
      
      [acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct
      /* include2/asm/processor.h:542 */
      struct mm_struct {
              struct vm_area_struct *    mmap;                 /*     0     4 */
              struct rb_root             mm_rb;                /*     4     4 */
              struct vm_area_struct *    mmap_cache;           /*     8     4 */
              long unsigned int          (*get_unmapped_area)(); /*    12     4 */
              void                       (*unmap_area)();      /*    16     4 */
              long unsigned int          mmap_base;            /*    20     4 */
              long unsigned int          task_size;            /*    24     4 */
              long unsigned int          cached_hole_size;     /*    28     4 */
              /* ---------- cacheline 1 boundary ---------- */
              long unsigned int          free_area_cache;      /*    32     4 */
              pgd_t *                    pgd;                  /*    36     4 */
              atomic_t                   mm_users;             /*    40     4 */
              atomic_t                   mm_count;             /*    44     4 */
              int                        map_count;            /*    48     4 */
              struct rw_semaphore        mmap_sem;             /*    52    64 */
              spinlock_t                 page_table_lock;      /*   116    40 */
              struct list_head           mmlist;               /*   156     8 */
              mm_counter_t               _file_rss;            /*   164     4 */
              mm_counter_t               _anon_rss;            /*   168     4 */
              long unsigned int          hiwater_rss;          /*   172     4 */
              long unsigned int          hiwater_vm;           /*   176     4 */
              long unsigned int          total_vm;             /*   180     4 */
              long unsigned int          locked_vm;            /*   184     4 */
              long unsigned int          shared_vm;            /*   188     4 */
              /* ---------- cacheline 6 boundary ---------- */
              long unsigned int          exec_vm;              /*   192     4 */
              long unsigned int          stack_vm;             /*   196     4 */
              long unsigned int          reserved_vm;          /*   200     4 */
              long unsigned int          def_flags;            /*   204     4 */
              long unsigned int          nr_ptes;              /*   208     4 */
              long unsigned int          start_code;           /*   212     4 */
              long unsigned int          end_code;             /*   216     4 */
              long unsigned int          start_data;           /*   220     4 */
              /* ---------- cacheline 7 boundary ---------- */
              long unsigned int          end_data;             /*   224     4 */
              long unsigned int          start_brk;            /*   228     4 */
              long unsigned int          brk;                  /*   232     4 */
              long unsigned int          start_stack;          /*   236     4 */
              long unsigned int          arg_start;            /*   240     4 */
              long unsigned int          arg_end;              /*   244     4 */
              long unsigned int          env_start;            /*   248     4 */
              long unsigned int          env_end;              /*   252     4 */
              /* ---------- cacheline 8 boundary ---------- */
              long unsigned int          saved_auxv[44];       /*   256   176 */
              cpumask_t                  cpu_vm_mask;          /*   432     4 */
              mm_context_t               context;              /*   436    68 */
              long unsigned int          swap_token_time;      /*   504     4 */
              char                       recent_pagein;        /*   508     1 */
              unsigned char              dumpable:2;           /*   509     1 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        core_waiters;         /*   512     4 */
              struct completion *        core_startup_done;    /*   516     4 */
              struct completion          core_done;            /*   520    52 */
              rwlock_t                   ioctx_list_lock;      /*   572    36 */
              struct kioctx *            ioctx_list;           /*   608     4 */
      }; /* size: 612, sum members: 610, holes: 1, sum holes: 2, cachelines: 20,
            last cacheline: 4 bytes */
      
      [acme@newtoy net-2.6.20]$ codiff -V /tmp/sched.o.before kernel/sched.o
      /pub/scm/linux/kernel/git/acme/net-2.6.20/kernel/sched.c:
        struct mm_struct |   -4
          dumpable:2;
           from: unsigned int          /*   432(30)    4(2) */
           to:   unsigned char         /*   509(6)     1(2) */
      < SNIP other offset changes >
       1 struct changed
      [acme@newtoy net-2.6.20]$
      
      I'm not aware of any problem about using 2 byte wide bitfields where
      previously a 4 byte wide one was, holler if there is any, I wouldn't be
      surprised, bitfields are things from hell.
      
      For the curious, 432(30) means: at offset 432 from the struct start, at
      offset 30 in the bitfield (yeah, it comes backwards, hellish, huh?) ditto
      for 509(6), while 4(2) and 1(2) means "struct field size(bitfield size)".
      
      Now we have a 2 bytes hole and are using only 4 bytes of the last 32
      bytes cacheline, any takers? :-)
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      36de6437
    • A
      [PATCH] mm: make compound page destructor handling explicit · 33f2ef89
      Andy Whitcroft 提交于
      Currently we we use the lru head link of the second page of a compound page
      to hold its destructor.  This was ok when it was purely an internal
      implmentation detail.  However, hugetlbfs overrides this destructor
      violating the layering.  Abstract this out as explicit calls, also
      introduce a type for the callback function allowing them to be type
      checked.  For each callback we pre-declare the function, causing a type
      error on definition rather than on use elsewhere.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33f2ef89
    • A
      [PATCH] slab: deprecate kmem_cache_t · 1b1cec4b
      Andrew Morton 提交于
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b1cec4b
    • C
      [PATCH] slab: remove kmem_cache_t · e18b890b
      Christoph Lameter 提交于
      Replace all uses of kmem_cache_t with struct kmem_cache.
      
      The patch was generated using the following script:
      
      	#!/bin/sh
      	#
      	# Replace one string by another in all the kernel sources.
      	#
      
      	set -e
      
      	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
      		quilt add $file
      		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
      		mv /tmp/$$ $file
      		quilt refresh
      	done
      
      The script was run like this
      
      	sh replace kmem_cache_t "struct kmem_cache"
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e18b890b
    • C
      [PATCH] slab: remove SLAB_DMA · 441e143e
      Christoph Lameter 提交于
      SLAB_DMA is an alias of GFP_DMA. This is the last one so we
      remove the leftover comment too.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      441e143e
    • C
      [PATCH] slab: remove SLAB_KERNEL · e94b1766
      Christoph Lameter 提交于
      SLAB_KERNEL is an alias of GFP_KERNEL.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e94b1766
    • C
      [PATCH] slab: remove SLAB_ATOMIC · 54e6ecb2
      Christoph Lameter 提交于
      SLAB_ATOMIC is an alias of GFP_ATOMIC
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      54e6ecb2
    • C
      [PATCH] slab: remove SLAB_USER · f7267c0c
      Christoph Lameter 提交于
      SLAB_USER is an alias of GFP_USER
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f7267c0c
    • C
      [PATCH] slab: remove SLAB_NOFS · e6b4f8da
      Christoph Lameter 提交于
      SLAB_NOFS is an alias of GFP_NOFS.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e6b4f8da
    • C
      [PATCH] slab: remove SLAB_NOIO · 55acbda0
      Christoph Lameter 提交于
      SLAB_NOIO is an alias of GFP_NOIO with a single instance of use.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55acbda0
    • C
      [PATCH] slab: remove SLAB_LEVEL_MASK · a06d72c1
      Christoph Lameter 提交于
      SLAB_LEVEL_MASK is only used internally to the slab and is
      and alias of GFP_LEVEL_MASK.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a06d72c1
    • C
      [PATCH] slab: remove SLAB_NO_GROW · 6e0eaa4b
      Christoph Lameter 提交于
      It is only used internally in the slab.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6e0eaa4b
    • A
      [PATCH] silence unused pgdat warning from alloc_bootmem_node and friends · 4af2bfc1
      Andy Whitcroft 提交于
      x86 NUMA systems only define bootmem for node 0.  alloc_bootmem_node() and
      friends therefore ignore the passed pgdat and use NODE_DATA(0) in all
      cases.  This leads to the following warnings as we are not using the passed
      parameter:
      
        .../mm/page_alloc.c: In function 'zone_wait_table_init':
        .../mm/page_alloc.c:2259: warning: unused variable 'pgdat'
      
      One option would be to define all variables used with these macros
      __attribute__ ((unused)), but this would leave us exposed should these
      become genuinely unused.
      
      The key here is that we _are_ using the value, we ignore it but that is a
      deliberate action.  This patch adds a nested local variable within the
      alloc_bootmem_node helper to which the pgdat parameter is assigned making
      it 'used'.  The nested local is marked __attribute__ ((unused)) to silence
      this same warning for it.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4af2bfc1
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • C
      [PATCH] Remove uses of kmem_cache_t from mm/* and include/linux/slab.h · ebe29738
      Christoph Lameter 提交于
      Remove all uses of kmem_cache_t (the most were left in slab.h).  The
      typedef for kmem_cache_t is then only necessary for other kernel
      subsystems.  Add a comment to that effect.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ebe29738
    • C
      [PATCH] Move names_cachep to linux/fs.h · b86c089b
      Christoph Lameter 提交于
      The names_cachep is used for getname() and putname().  So lets put it into
      fs.h near those two definitions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b86c089b
    • C
      [PATCH] Move fs_cachep to linux/fs_struct.h · aa362a83
      Christoph Lameter 提交于
      fs_cachep is only used in kernel/exit.c and in kernel/fork.c.
      
      It is used to store fs_struct items so it should be placed in linux/fs_struct.h
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aa362a83
    • C
      [PATCH] Move filep_cachep to include/file.h · 8b7d91eb
      Christoph Lameter 提交于
      filp_cachep is only used in fs/file_table.c and in fs/dcache.c where
      it is defined.
      
      Move it to related definitions in linux/file.h.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8b7d91eb
    • C
      [PATCH] Move files_cachep to include/file.h · 5d6538fc
      Christoph Lameter 提交于
      Proper place is in file.h since files_cachep uses are rated to file I/O.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5d6538fc
    • C
      [PATCH] Move vm_area_cachep to include/mm.h · c43692e8
      Christoph Lameter 提交于
      vm_area_cachep is used to store vm_area_structs. So move to mm.h.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c43692e8
    • C
      [PATCH] Move sighand_cachep to include/signal.h · 298ec1e2
      Christoph Lameter 提交于
      Move sighand_cachep definitioni to linux/signal.h
      
      The sighand cache is only used in fs/exec.c and kernel/fork.c.  It is defined
      in kernel/fork.c but only used in fs/exec.c.
      
      The sighand_cachep is related to signal processing.  So add the definition to
      signal.h.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      298ec1e2
    • C
      [PATCH] Remove bio_cachep from slab.h · 54cc211c
      Christoph Lameter 提交于
      Remove bio_cachep from slab.h - it no longer exists.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      54cc211c
    • C
      [PATCH] node-aware skb allocation · b30973f8
      Christoph Hellwig 提交于
      Node-aware allocation of skbs for the receive path.
      
      Details:
      
        - __alloc_skb gets a new node argument and cals the node-aware
          slab functions with it.
        - netdev_alloc_skb passed the node number it gets from dev_to_node
          to it, everyone else passes -1 (any node)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b30973f8
    • C
      [PATCH] add numa node information to struct device · 87348136
      Christoph Hellwig 提交于
      For node-aware skb allocations we need information about the node in struct
      net_device or struct device.  Davem suggested to put it into struct device
      which this patch does.
      
      In particular:
      
       - struct device gets a new int numa_node member if CONFIG_NUMA is set
       - there are two new helpers, dev_to_node and set_dev_node to
         transparently deal with the non-numa case
       - for pci devices the node-info is set to the value we get from
         pcibus_to_node.
      
      Note that for some architectures pcibus_to_node doesn't work yet at the time
      we call it currently.  This is harmless and will just mean skb allocations
      aren't node-local on this architectures until the implementation of
      pcibus_to_node on these architectures have been updated (There are patches for
      x86 and x86_64 floating around)
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      87348136
    • C
      [PATCH] leak tracking for kmalloc_node · 8b98c169
      Christoph Hellwig 提交于
      We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
      the caller.  This is used for subsystem-specific allocators like skb_alloc.
      
      To make skb_alloc node-aware we need similar routines for the node-aware slab
      allocator, which this patch adds.
      
      Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:
      
      [akpm@osdl.org: add module export]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8b98c169
    • P
      [PATCH] mm: k{,um}map_atomic() vs in_atomic() · ad76fb6b
      Peter Zijlstra 提交于
      Make kmap_atomic/kunmap_atomic denote a pagefault disabled scope.  All non
      trivial implementations already do this anyway.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ad76fb6b
    • P
      [PATCH] mm: pagefault_{disable,enable}() · a866374a
      Peter Zijlstra 提交于
      Introduce pagefault_{disable,enable}() and use these where previously we did
      manual preempt increments/decrements to make the pagefault handler do the
      atomic thing.
      
      Currently they still rely on the increased preempt count, but do not rely on
      the disabled preemption, this might go away in the future.
      
      (NOTE: the extra barrier() in pagefault_disable might fix some holes on
             machines which have too many registers for their own good)
      
      [heiko.carstens@de.ibm.com: s390 fix]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a866374a
    • C
      [PATCH] shared page table for hugetlb page · 39dde65c
      Chen, Kenneth W 提交于
      Following up with the work on shared page table done by Dave McCracken.  This
      set of patch target shared page table for hugetlb memory only.
      
      The shared page table is particular useful in the situation of large number of
      independent processes sharing large shared memory segments.  In the normal
      page case, the amount of memory saved from process' page table is quite
      significant.  For hugetlb, the saving on page table memory is not the primary
      objective (as hugetlb itself already cuts down page table overhead
      significantly), instead, the purpose of using shared page table on hugetlb is
      to allow faster TLB refill and smaller cache pollution upon TLB miss.
      
      With PT sharing, pte entries are shared among hundreds of processes, the cache
      consumption used by all the page table is smaller and in return, application
      gets much higher cache hit ratio.  One other effect is that cache hit ratio
      with hardware page walker hitting on pte in cache will be higher and this
      helps to reduce tlb miss latency.  These two effects contribute to higher
      application performance.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39dde65c
    • N
      [PATCH] mm: add arch_alloc_page · cc102509
      Nick Piggin 提交于
      Add an arch_alloc_page to match arch_free_page.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc102509
    • A
      [PATCH] new scheme to preempt swap token · 7602bdf2
      Ashwin Chaugule 提交于
      The new swap token patches replace the current token traversal algo.  The old
      algo had a crude timeout parameter that was used to handover the token from
      one task to another.  This algo, transfers the token to the tasks that are in
      need of the token.  The urgency for the token is based on the number of times
      a task is required to swap-in pages.  Accordingly, the priority of a task is
      incremented if it has been badly affected due to swap-outs.  To ensure that
      the token doesnt bounce around rapidly, the token holders are given a priority
      boost.  The priority of tasks is also decremented, if their rate of swap-in's
      keeps reducing.  This way, the condition to check whether to pre-empt the swap
      token, is a matter of comparing two task's priority fields.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7602bdf2
    • P
      [PATCH] memory page_alloc zonelist caching reorder structure · 7253f4ef
      Paul Jackson 提交于
      Rearrange the struct members in the 'struct zonelist_cache' structure, so
      as to put the readonly (once initialized) z_to_n[] array first, where it
      will come right after the zones[] array in struct zonelist.
      
      This pretty much eliminates the chance that the two frequently written
      elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
      times, will end up on the same cache line as the performance sensitive,
      frequently read, never (after init) written zones[] array.
      
      Keeping frequently written data off frequently read cache lines is good for
      performance.
      
      Thanks to Rohit Seth for the suggestion.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7253f4ef
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
    • C
      [PATCH] Get rid of zone_table[] · 89689ae7
      Christoph Lameter 提交于
      The zone table is mostly not needed.  If we have a node in the page flags
      then we can get to the zone via NODE_DATA() which is much more likely to be
      already in the cpu cache.
      
      In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
      access an exact replica of zonetable in the node_zones field.  In all of
      the above cases there will be no need at all for the zone table.
      
      The only remaining case is if in a NUMA system the node numbers do not fit
      into the page flags.  In that case we make sparse generate a table that
      maps sections to nodes and use that table to to figure out the node number.
       This table is sized to fit in a single cache line for the known 32 bit
      NUMA platform which makes it very likely that the information can be
      obtained without a cache miss.
      
      For sparsemem the zone table seems to be have been fairly large based on
      the maximum possible number of sections and the number of zones per node.
      There is some memory saving by removing zone_table.  The main benefit is to
      reduce the cache foootprint of the VM from the frequent lookups of zones.
      Plus it simplifies the page allocator.
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89689ae7
    • A
      [PATCH] add bottom_half.h · 676dcb8b
      Andrew Morton 提交于
      With CONFIG_SMP=n:
      
        drivers/input/ff-memless.c:384: warning: implicit declaration of function 'local_bh_disable'
        drivers/input/ff-memless.c:393: warning: implicit declaration of function 'local_bh_enable'
      
      Really linux/spinlock.h should include linux/interrupt.h.  But interrupt.h
      includes sched.h which will need spinlock.h.
      
      So the patch breaks the _bh declarations out into a separate header and
      includes it in both interrupt.h and spinlock.h.
      
      Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <stable@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676dcb8b
  2. 07 12月, 2006 2 次提交