提交 · ed391f4ebf8f701d3566423ce8f17e614cde9806 · openeuler / raspberrypi-kernel

14 1月, 2010 1 次提交

vfs: Fix vmtruncate() regression · cedabed4

由 OGAWA Hirofumi 提交于 1月 13, 2010

If __block_prepare_write() was failed in block_write_begin(), the
allocated blocks can be outside of ->i_size.

But new truncate_pagecache() in vmtuncate() does nothing if new < old.
It means the above usage is not working anymore.

So, this patch fixes it by removing "new < old" check. It would need
more cleanup/change. But, now -rc and truncate working is in progress,
so, this tried to fix it minimum change.
Acked-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cedabed4

16 12月, 2009 1 次提交

memcg: coalesce uncharge during unmap/truncate · 569b846d

由 KAMEZAWA Hiroyuki 提交于 12月 15, 2009

In massive parallel enviroment, res_counter can be a performance
bottleneck.  One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.

This patch is a for coalescing uncharge.  For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.  A reason for per-task batching is for making use
of caller's context information.  We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config]
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults
[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 miss/faults
[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 miss/faults
[child with this patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 miss/faults

We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.

Changelog(since 2009/10/02):
 - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
 - some clean up and commentary/description updates.
 - added initialize code to copy_process(). (possible bug fix)

Changelog(old):
 - fixed !CONFIG_MEM_CGROUP case.
 - rebased onto the latest mmotm + softlimit fix patches.
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

569b846d

04 12月, 2009 1 次提交

mm: fix comments for invalidate_inode_pages2() · e9de25dd

由 Peng Tao 提交于 10月 19, 2009

invalidate_inode_pages2() returns -EBUSY *NOT* -EIO if any pages could not be
invalidated.
Signed-off-by: NPeng Tao <bergwolf@gmail.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

e9de25dd

24 9月, 2009 1 次提交

truncate: new helpers · 25d9e2d1

由 npiggin@suse.de 提交于 8月 21, 2009

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

25d9e2d1

16 9月, 2009 3 次提交

HWPOISON: Define a new error_remove_page address space op for async truncation · 25718736

由 Andi Kleen 提交于 9月 16, 2009

Truncating metadata pages is not safe right now before
we haven't audited all file systems.

To enable truncation only for data address space define
a new address_space callback error_remove_page.

This is used for memory_failure.c memory error handling.

This can be then set to truncate_inode_page()

This patch just defines the new operation and adds documentation.

Callers and users come in followon patches.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

25718736

HWPOISON: Add invalidate_inode_page · 83f78668

由 Wu Fengguang 提交于 9月 16, 2009

Add a simple way to invalidate a single page
This is just a refactoring of the truncate.c code.
Originally from Fengguang, modified by Andi Kleen.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

83f78668

HWPOISON: Refactor truncate to allow direct truncating of page v2 · 750b4987

由 Nick Piggin 提交于 9月 16, 2009

Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c

[AK: description, headers, fix typos]
v2: Some white space changes from Fengguang Wu
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

750b4987

17 6月, 2009 1 次提交

mm: remove __invalidate_mapping_pages variant · 28697355

由 Mike Waychison 提交于 6月 16, 2009

Remove __invalidate_mapping_pages atomic variant now that its sole caller
can sleep (fixed in eccb95ce ("vfs: fix
lock inversion in drop_pagecache_sb()")).

This fixes softlockups that can occur while in the drop_caches path.
Signed-off-by: NMike Waychison <mikew@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28697355

29 5月, 2009 1 次提交

memcg: fix deadlock between lock_page_cgroup and mapping tree_lock · e767e056

由 Daisuke Nishimura 提交于 5月 28, 2009

mapping->tree_lock can be acquired from interrupt context.  Then,
following dead lock can occur.

Assume "A" as a page.

 CPU0:
       lock_page_cgroup(A)
		interrupted
			-> take mapping->tree_lock.
 CPU1:
       take mapping->tree_lock
		-> lock_page_cgroup(A)

This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock.  charge/uncharge of pagecache/swapcache is protected
by page lock, not tree_lock.

After this patch, lock_page_cgroup() is not called under mapping->tree_lock.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e767e056

03 4月, 2009 1 次提交

FS-Cache: Recruit a page flags for cache management · 266cf658

由 David Howells 提交于 4月 03, 2009

Recruit a page flag to aid in cache management.  The following extra flag is
defined:

 (1) PG_fscache (PG_private_2)

     The marked page is backed by a local cache and is pinning resources in the
     cache driver.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NSteve Dickson <steved@redhat.com>
Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Tested-by: NDaire Byrne <Daire.Byrne@framestore.com>

266cf658

20 10月, 2008 1 次提交

mmap: handle mlocked pages during map, remap, unmap · ba470de4

由 Rik van Riel 提交于 10月 18, 2008

Originally by Nick Piggin <npiggin@suse.de>

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ba470de4

17 10月, 2008 1 次提交

Remove Andrew Morton's old email accounts · e1f8e874

由 Francois Cami 提交于 10月 15, 2008

People can use the real name an an index into MAINTAINERS to find the
current email address.
Signed-off-by: NFrancois Cami <francois.cami@free.fr>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e1f8e874

03 9月, 2008 1 次提交

VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806

由 Hisashi Hifumi 提交于 9月 02, 2008

Dio write returns EIO when try_to_release_page fails because bh is
still referenced.

The patch

    commit 3f31fddf
    Author: Mingming Cao <cmm@us.ibm.com>
    Date:   Fri Jul 25 01:46:22 2008 -0700

        jbd: fix race between free buffer and commit transaction

was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.

I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.

The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race, freeing
buffer(dio) and ext3/4_ordered_writepage.

: background_writeout()
     ->write_cache_pages()
       ->ext3_ordered_writepage()
     	   walk_page_buffers() -> take a bh ref
 	   block_write_full_page() -> unlock_page
		: <- end_page_writeback
                : <- race! (dio write->try_to_release_page fails)
      	   walk_page_buffers() ->release a bh ref

ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.

To fix this race, I used the approach of falling back to buffered
writes if try_to_release_page() fails on a page.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6ccfa806

05 8月, 2008 1 次提交

mm: rename page trylock · 529ae9aa

由 Nick Piggin 提交于 8月 02, 2008

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

529ae9aa

03 8月, 2008 1 次提交

mm: dont clear PG_uptodate on truncate/invalidate · 84209e02

由 Miklos Szeredi 提交于 8月 01, 2008

Brian Wang reported that a FUSE filesystem exported through NFS could
return I/O errors on read.  This was traced to splice_direct_to_actor()
returning a short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably
NFS) also call invalidate_inode_pages2() to purge stale data from the
cache.

If this happens while such pages are sitting in a pipe buffer, then
splice(2) from the pipe can return zero, and read(2) from the pipe can
return ENODATA.

The zero return is especially bad, since it implies end-of-file or
disconnected pipe/socket, and is documented as such for splice.  But
returning an error for read() is also nasty, when in fact there was no
error (data becoming stale is not an error).

The same problems can be triggered by "hole punching" with
madvise(MADV_REMOVE).

Fix this by not clearing the PG_uptodate flag on truncation and
invalidation.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Acked-by: NNick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

84209e02

27 7月, 2008 1 次提交

mm: spinlock tree_lock · 19fd6231

由 Nick Piggin 提交于 7月 25, 2008

mapping->tree_lock has no read lockers.  convert the lock from an rwlock
to a spinlock.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

19fd6231

28 4月, 2008 1 次提交

fix invalidate_inode_pages2_range() to not clear ret · 0dd1334f

由 Hisashi Hifumi 提交于 4月 28, 2008

DIO invalidates page cache through invalidate_inode_pages2_range().
invalidate_inode_pages2_range() sets ret=-EIO when
invalidate_complete_page2() fails, but this ret is cleared if
do_launder_page() succeed on a page of next index.

In this case, dio is carried out even if invalidate_complete_page2() fails
on some pages.

This can cause inconsistency between memory and blocks on HDD because the
page cache still exists.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <cel@citi.umich.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0dd1334f

04 3月, 2008 1 次提交

docbook: fix kernel-api source files · 0643245f

由 Randy Dunlap 提交于 2月 29, 2008

Fix docbook problems in kernel-api.tmpl.
These cause the generated docbook to be incorrect.
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0643245f

06 2月, 2008 3 次提交

page migraton: handle orphaned pages · 62e1c553

由 Shaohua Li 提交于 2月 04, 2008

Orphaned page might have fs-private metadata, the page is truncated.  As
the page hasn't mapping, page migration refuse to migrate the page.  It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail.  I thought
we could free the metadata so such page can be freed in migration and make
migration more reliable.

[akpm@linux-foundation.org: go direct to try_to_free_buffers()]
Signed-off-by: NShaohua Li <shaohua.li@intel.com>
Acked-by: NNick Piggin <npiggin@suse.de>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

62e1c553

Fix dirty page accounting leak with ext3 data=journal · a2b34564

由 Bjorn Steinbrink 提交于 2月 04, 2008

In 46d2277c ("Clean up and make
try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
was changed to bail out if the page was dirty.

That in turn caused truncate_complete_page to leak massive amounts of
memory, because the dirty bit was only cleared after the call to
try_to_free_buffers.

So the call to cancel_dirty_page was moved up to have the dirty bit
cleared early in 3e67c098 ("truncate:
clear page dirtiness before running try_to_free_buffers()").

The problem with that fix is, that the page can be redirtied after
cancel_dirty_page was called, eg. like this:

truncate_complete_page()
  cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
  do_invalidatepage()
    ext3_invalidatepage()
      journal_invalidatepage()
        journal_unmap_buffer()
          __dispose_buffer()
            __journal_unfile_buffer()
              __journal_temp_unlink_buffer()
                mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

And then we end up with dirty pages being wrongly accounted.

As a result, in ecdfc978 ("Resurrect
'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
were reverted, so the original reason for the massive memory leak is
gone, and we can also revert the move of the call to cancel_dirty_page
from truncate_complete_page and get the accounting right again.

I'm not sure if it matters, but opposed to the final check in
__remove_from_page_cache, this one also cares about the task io
accounting, so maybe we want to use this instead, although it's not
quite the clean fix either.
Signed-off-by: NBjörn Steinbrink <B.Steinbrink@gmx.de>
Tested-by: NKrzysztof Piotr Oledzki <ole@ans.pl>
Cc: Jan Kara <jack@ucw.cz>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Osterried <osterried@jesse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2b34564

Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3

由 Christoph Lameter 提交于 2月 04, 2008

Simplify page cache zeroing of segments of pages through 3 functions

zero_user_segments(page, start1, end1, start2, end2)

        Zeros two segments of the page. It takes the position where to
        start and end the zeroing which avoids length calculations and
	makes code clearer.

zero_user_segment(page, start, end)

        Same for a single segment.

zero_user(page, start, length)

        Length variant for the case where we know the length.

We remove the zero_user_page macro. Issues:

1. Its a macro. Inline functions are preferable.

2. The KM_USER0 macro is only defined for HIGHMEM.

   Having to treat this special case everywhere makes the
   code needlessly complex. The parameter for zeroing is always
   KM_USER0 except in one single case that we open code.

Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.

Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.

Also extract the flushing of the caches to be outside of the kmap.

[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eebd2aa3

04 2月, 2008 1 次提交

do_invalidatepage() comment typo fix · 28bc44d7

由 Fengguang Wu 提交于 2月 03, 2008

Fix a typo in the comment for do_invalidatepage().
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: NAdrian Bunk <bunk@kernel.org>

28bc44d7

17 10月, 2007 2 次提交

Drop some headers from mm.h · 4af3c9cc

由 Alexey Dobriyan 提交于 10月 16, 2007

mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.

Cross-compile tested on many configs and archs.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4af3c9cc

mm: count reclaimable pages per BDI · c9e51e41

由 Peter Zijlstra 提交于 10月 16, 2007

Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9e51e41

20 7月, 2007 2 次提交

mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821

由 Nick Piggin 提交于 7月 19, 2007

Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.

->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping.  The hitch here
is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
 But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).

Having the populate handler install the pte itself is likewise a nasty thing
to be doing.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn.  Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache.  Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
users have hit mainline yet.

[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54cb8821

mm: fix fault vs invalidate race for linear mappings · d00806b1

由 Nick Piggin 提交于 7月 19, 2007

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.

The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.

Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.

This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).

This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d00806b1

18 7月, 2007 1 次提交

fs: introduce some page/buffer invariants · 787d2214

由 Nick Piggin 提交于 7月 17, 2007

It is a bug to set a page dirty if it is not uptodate unless it has
buffers. If the page has buffers, then the page may be dirty (some buffers
dirty) but not uptodate (some buffers not uptodate). The exception to this
rule is if the set_page_dirty caller is racing with truncate or invalidate.

A buffer can not be set dirty if it is not uptodate.

If either of these situations occurs, it indicates there could be some data
loss problem. Some of these warnings could be a harmless one where the
page or buffer is set uptodate immediately after it is dirtied, however we
should fix those up, and enforce this ordering.

Bring the order of operations for truncate into line with those of
invalidate. This will prevent a page from being able to go !uptodate while
we're holding the tree_lock, which is probably a good thing anyway.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

787d2214

17 7月, 2007 2 次提交

invalidate_mapping_pages(): add cond_resched · fc9a07e7

由 Andrew Morton 提交于 7月 15, 2007

invalidate_mapping_pages() can sometimes take a long time (millions of pages
to free). Long enough for the softlockup detector to trigger.

We used to have a cond_resched() in there but I took it out because the
drop_caches code calls invalidate_mapping_pages() under inode_lock.

The patch adds a nasty flag and puts the cond_resched() back.
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc9a07e7

vmscan: fix comments related to shrink_list() · 2706a1b8

由 Anderson Briglia 提交于 7月 15, 2007

Fix the shrink_list name on some files under mm/ directory.
Signed-off-by: NAnderson Briglia <anderson.briglia@indt.org.br>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2706a1b8

10 5月, 2007 1 次提交

fs: convert core functions to zero_user_page · 01f2705d

由 Nate Diller 提交于 5月 09, 2007

It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset().  There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones.  Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call.  The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: NNate Diller <nate.diller@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01f2705d

02 3月, 2007 1 次提交

[PATCH] VM: invalidate_inode_pages2_range() should not exit early · 7b965e08

由 Trond Myklebust 提交于 2月 28, 2007

Fix invalidate_inode_pages2_range() so that it does not immediately exit
just because a single page in the specified range could not be removed.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7b965e08

12 2月, 2007 2 次提交

[PATCH] remove invalidate_inode_pages() · fc0ecff6

由 Andrew Morton 提交于 2月 10, 2007

Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().

Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc0ecff6

[PATCH] Export invalidate_mapping_pages() to modules · 54bc4855

由 Anton Altaparmakov 提交于 2月 10, 2007

It makes no sense to me to export invalidate_inode_pages() and not
invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
because of its range specification ability...

akpm: also remove the export of invalidate_inode_pages() by making it an
inlined wrapper.
Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54bc4855

27 1月, 2007 2 次提交

[PATCH] MM: Remove [PATCH] invalidate_inode_pages2_range() debug · 569d3287

由 Trond Myklebust 提交于 1月 26, 2007

NFS can handle the case where invalidate_inode_pages2_range() fails, so the
premise behind commit 8258d4a5 is now gone.

Remove the WARN_ON_ONCE() which is causing users grief as we can see from
http://bugzilla.kernel.org/show_bug.cgi?id=7826Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

569d3287

Resurrect 'try_to_free_buffers()' VM hackery · ecdfc978

由 Linus Torvalds 提交于 1月 26, 2007

It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean.  This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.

Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.

This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:

	http://bugzilla.kernel.org/show_bug.cgi?id=7844

Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).
Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
Acked-by: NJens Axboe <jens.axboe@oracle.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ecdfc978

12 1月, 2007 1 次提交

[PATCH] NFS: Fix race in nfs_release_page() · e3db7691

由 Trond Myklebust 提交于 1月 10, 2007

    NFS: Fix race in nfs_release_page()

    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.

    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

    Also, the bare SetPageDirty() can skew all sort of accounting leading to
    other nasties.

[akpm@osdl.org: cleanup]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e3db7691

24 12月, 2006 1 次提交

Clean up and export cancel_dirty_page() to modules · 8368e328

由 Linus Torvalds 提交于 12月 23, 2006

Make cancel_dirty_page() act more like all the other dirty and writeback
accounting functions: test for "mapping" being NULL, and do the
NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()).

Also, add it to the exports, so that modular filesystems can use it.
Acked-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8368e328

23 12月, 2006 1 次提交

[PATCH] truncate: dirty memory accounting fix · 5f2a105d

由 Andrew Morton 提交于 12月 22, 2006

Only (un)account for IO and page-dirtying for devices which have real backing
store (ie: not tmpfs or ramdisks).

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

5f2a105d

22 12月, 2006 2 次提交

[PATCH] truncate: clear page dirtiness before running try_to_free_buffers() · 3e67c098

由 Andrew Morton 提交于 12月 21, 2006

truncate presently invalidates the dirty page's buffer_heads then shoots down
the page.  But try_to_free_buffers() will now bale out because the page is
dirty.

Net effect: the LRU gets filled with dirty pages which have invalidated
buffer_heads attached.  They have no ->mapping and hence cannot be cleaned.
The machine leaks memory at an enormous rate.

Fix this by cleaning the page before running try_to_free_buffers(), so
try_to_free_buffers() can do its work.

Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
machine won't wedge up trying to write non-existent dirty pages.

Probably still wrong, but now less so.
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

3e67c098

VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()" functions · fba2591b

由 Linus Torvalds 提交于 12月 20, 2006

They were horribly easy to mis-use because of their tempting naming, and
they also did way more than any users of them generally wanted them to
do.

A dirty page can become clean under two circumstances:

 (a) when we write it out.  We have "clear_page_dirty_for_io()" for
     this, and that function remains unchanged.

     In the "for IO" case it is not sufficient to just clear the dirty
     bit, you also have to mark the page as being under writeback etc.

 (b) when we actually remove a page due to it becoming inaccessible to
     users, notably because it was truncate()'d away or the file (or
     metadata) no longer exists, and we thus want to cancel any
     outstanding dirty state.

For the (b) case, we now introduce "cancel_dirty_page()", which only
touches the page state itself, and verifies that the page is not mapped
(since cancelling writes on a mapped page would be actively wrong as it
is still accessible to users).

Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
ReiserFS, XFS all use the old confusing functions, and will be fixed
separately in subsequent commits (with some of them just removing the
offending logic, and others using clear_page_dirty_for_io()).

This was confirmed by Martin Michlmayr to fix the apt database
corruption on ARM.

Cc: Martin Michlmayr <tbm@cyrius.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrei Popa <andrei.popa@i-neo.ro>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

fba2591b