1. 26 8月, 2011 1 次提交
    • J
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer 提交于
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
      
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          vfs_readdir+0x5b/0xb4
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e096d0c7
  2. 26 7月, 2011 1 次提交
  3. 24 7月, 2011 1 次提交
    • T
      VFS : mount lock scalability for internal mounts · 423e0ab0
      Tim Chen 提交于
      For a number of file systems that don't have a mount point (e.g. sockfs
      and pipefs), they are not marked as long term. Therefore in
      mntput_no_expire, all locks in vfs_mount lock are taken instead of just
      local cpu's lock to aggregate reference counts when we release
      reference to file objects.  In fact, only local lock need to have been
      taken to update ref counts as these file systems are in no danger of
      going away until we are ready to unregister them.
      
      The attached patch marks file systems using kern_mount without
      mount point as long term.  The contentions of vfs_mount lock
      is now eliminated.  Before un-registering such file system,
      kern_unmount should be called to remove the long term flag and
      make the mount point ready to be freed.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      423e0ab0
  4. 27 5月, 2011 1 次提交
  5. 25 5月, 2011 1 次提交
  6. 23 3月, 2011 1 次提交
  7. 07 1月, 2011 1 次提交
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  8. 12 11月, 2010 1 次提交
  9. 29 10月, 2010 1 次提交
  10. 26 10月, 2010 1 次提交
    • C
      fs: do not assign default i_ino in new_inode · 85fe4025
      Christoph Hellwig 提交于
      Instead of always assigning an increasing inode number in new_inode
      move the call to assign it into those callers that actually need it.
      For now callers that need it is estimated conservatively, that is
      the call is added to all filesystems that do not assign an i_ino
      by themselves.  For a few more filesystems we can avoid assigning
      any inode number given that they aren't user visible, and for others
      it could be done lazily when an inode number is actually needed,
      but that's left for later patches.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      85fe4025
  11. 15 10月, 2010 1 次提交
    • A
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann 提交于
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
  12. 08 10月, 2010 1 次提交
    • N
      hugetlb: hugepage migration core · 290408d4
      Naoya Horiguchi 提交于
      This patch extends page migration code to support hugepage migration.
      One of the potential users of this feature is soft offlining which
      is triggered by memory corrected errors (added by the next patch.)
      
      Todo:
      - there are other users of page migration such as memory policy,
        memory hotplug and memocy compaction.
        They are not ready for hugepage support for now.
      
      ChangeLog since v4:
      - define migrate_huge_pages()
      - remove changes on isolation/putback_lru_page()
      
      ChangeLog since v2:
      - refactor isolate/putback_lru_page() to handle hugepage
      - add comment about race on unmap_and_move_huge_page()
      
      ChangeLog since v1:
      - divide migration code path for hugepage
      - define routine checking migration swap entry for hugetlb
      - replace "goto" with "if/else" in remove_migration_pte()
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      290408d4
  13. 10 8月, 2010 3 次提交
    • A
      new helper: end_writeback() · b0683aa6
      Al Viro 提交于
      Essentially, the minimal variant of ->evict_inode().  It's
      a trimmed-down clear_inode(), sans any fs callbacks.  Once
      it returns we know that no async writeback will be happening;
      every ->evict_inode() instance should do that once and do that
      before doing anything ->write_inode() could interfere with
      (e.g. freeing the on-disk inode).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0683aa6
    • A
      switch hugetlbfs to ->evict_inode() · 2bbbda30
      Al Viro 提交于
      The first spoils - hugetlb can use default ->drop_inode() now.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2bbbda30
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
  14. 28 5月, 2010 1 次提交
    • C
      rename the generic fsync implementations · 1b061d92
      Christoph Hellwig 提交于
      We don't name our generic fsync implementations very well currently.
      The no-op implementation for in-memory filesystems currently is called
      simple_sync_file which doesn't make too much sense to start with,
      the the generic one for simple filesystems is called simple_fsync
      which can lead to some confusion.
      
      This patch renames the generic file fsync method to generic_file_fsync
      to match the other generic_file_* routines it is supposed to be used
      with, and the no-op implementation to noop_fsync to make it obvious
      what to expect.  In addition add some documentation for both methods.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b061d92
  15. 17 12月, 2009 2 次提交
  16. 24 9月, 2009 2 次提交
  17. 23 9月, 2009 1 次提交
  18. 22 9月, 2009 1 次提交
    • E
      hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount · 6bfde05b
      Eric B Munson 提交于
      This patchset adds a flag to mmap that allows the user to request that an
      anonymous mapping be backed with huge pages.  This mapping will borrow
      functionality from the huge page shm code to create a file on the kernel
      internal mount and use it to approximate an anonymous mapping.  The
      MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
      both flags being preset.
      
      A new flag is necessary because there is no other way to hook into huge
      pages without creating a file on a hugetlbfs mount which wouldn't be
      MAP_ANONYMOUS.
      
      To userspace, this mapping will behave just like an anonymous mapping
      because the file is not accessible outside of the kernel.
      
      This patchset is meant to simplify the programming model.  Presently there
      is a large chunk of boiler platecode, contained in libhugetlbfs, required
      to create private, hugepage backed mappings.  This patch set would allow
      use of hugepages without linking to libhugetlbfs or having hugetblfs
      mounted.
      
      Unification of the VM code would provide these same benefits, but it has
      been resisted each time that it has been suggested for several reasons: it
      would break PAGE_SIZE assumptions across the kernel, it makes page-table
      abstractions really expensive, and it does not provide any benefit on
      architectures that do not support huge pages, incurring fast path
      penalties without providing any benefit on these architectures.
      
      This patch:
      
      There are two means of creating mappings backed by huge pages:
      
              1. mmap() a file created on hugetlbfs
              2. Use shm which creates a file on an internal mount which essentially
                 maps it MAP_SHARED
      
      The internal mount is only used for shared mappings but there is very
      little that stops it being used for private mappings. This patch extends
      hugetlbfs_file_setup() to deal with the creation of files that will be
      mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
      used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.
      Signed-off-by: NEric Munson <ebmunson@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bfde05b
  19. 11 9月, 2009 1 次提交
  20. 25 8月, 2009 1 次提交
    • H
      mm: fix hugetlb bug due to user_shm_unlock call · 353d5c30
      Hugh Dickins 提交于
      2.6.30's commit 8a0bdec1 removed
      user_shm_lock() calls in hugetlb_file_setup() but left the
      user_shm_unlock call in shm_destroy().
      
      In detail:
      Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock()
      is not called in hugetlb_file_setup(). However, user_shm_unlock() is
      called in any case in shm_destroy() and in the following
      atomic_dec_and_lock(&up->__count) in free_uid() is executed and if
      up->__count gets zero, also cleanup_user_struct() is scheduled.
      
      Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set.
      However, the ref counter up->__count gets unexpectedly non-positive and
      the corresponding structs are freed even though there are live
      references to them, resulting in a kernel oops after a lots of
      shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set.
      
      Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the
      time of shm_destroy() may give a different answer from at the time
      of hugetlb_file_setup().  And fixed newseg()'s no_id error path,
      which has missed user_shm_unlock() ever since it came in 2.6.9.
      Reported-by: NStefan Huber <shuber2@gmail.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Tested-by: NStefan Huber <shuber2@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353d5c30
  21. 22 5月, 2009 1 次提交
  22. 13 5月, 2009 1 次提交
    • M
      Remove implementation of readpage from the hugetlbfs_aops · f2deae9d
      Mel Gorman 提交于
      The core VM assumes the page size used by the address_space in
      inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by
      inserting pages into the page cache at offsets the core VM considers
      unexpected.
      
      This would not be a problem except that hugetlbfs also provide a
      ->readpage implementation.  As it exists, the core VM can assume the
      base page size is being used, allocate pages on behalf of the
      filesystem, insert them into the page cache and call ->readpage to
      populate them.  These pages are the wrong size and at the wrong offset
      for hugetlbfs causing confusion.
      
      This patch deletes the ->readpage implementation for hugetlbfs on the
      grounds the core VM should not be allocating and populating pages on
      behalf of hugetlbfs.  There should be no existing users of the
      ->readpage implementation so it should not cause a regression.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2deae9d
  23. 22 4月, 2009 1 次提交
  24. 01 4月, 2009 2 次提交
  25. 11 2月, 2009 1 次提交
    • M
      Do not account for the address space used by hugetlbfs using VM_ACCOUNT · 5a6fe125
      Mel Gorman 提交于
      When overcommit is disabled, the core VM accounts for pages used by anonymous
      shared, private mappings and special mappings. It keeps track of VMAs that
      should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
      with VM_NORESERVE.
      
      Overcommit for hugetlbfs is much riskier than overcommit for base pages
      due to contiguity requirements. It avoids overcommiting on both shared and
      private mappings using reservation counters that are checked and updated
      during mmap(). This ensures (within limits) that hugepages exist in the
      future when faults occurs or it is too easy to applications to be SIGKILLed.
      
      As hugetlbfs makes its own reservations of a different unit to the base page
      size, VM_ACCOUNT should never be set. Even if the units were correct, we would
      double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
      be set because an application can request no reserves be made for hugetlbfs
      at the risk of getting killed later.
      
      With commit fc8744ad, VM_NORESERVE and
      VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
      breaks the accounting for both the core VM and hugetlbfs, can trigger an
      OOM storm when hugepage pools are too small lockups and corrupted counters
      otherwise are used. This patch brings hugetlbfs more in line with how the
      core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a6fe125
  26. 07 1月, 2009 1 次提交
  27. 06 1月, 2009 1 次提交
  28. 14 11月, 2008 3 次提交
  29. 14 10月, 2008 1 次提交
  30. 27 7月, 2008 1 次提交
  31. 25 7月, 2008 3 次提交
    • A
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen 提交于
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • A
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen 提交于
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • M
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman 提交于
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3