提交 · 7f5aa215088b817add9c71914b83650bdd49f8a9 · openeuler / raspberrypi-kernel

11 2月, 2009 1 次提交

jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() · 7f5aa215

由 Jan Kara 提交于 2月 10, 2009

If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Acked-by: NJoel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>

7f5aa215

30 1月, 2009 1 次提交

ext4: Remove bogus BUG() check in ext4_bmap() · b9ec63f7

由 Theodore Ts'o 提交于 1月 30, 2009

The code to support journal-less ext4 operation added a BUG to
ext4_bmap() which fired if there was no journal and the
EXT4_STATE_JDATA bit was set in the i_state field.  This caused
running the filefrag program (which uses the FIMBAP ioctl) to trigger
a BUG().

The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
harmless for the bit to be set.  We could add a check in
__ext4_journalled_writepage() and ext4_journalled_write_end() to only
set the EXT4_STATE_JDATA bit if the journal is present, but that adds
an extra test and jump instruction.  It's easier to simply remove the
BUG check.

http://bugzilla.kernel.org/show_bug.cgi?id=12568Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org

b9ec63f7

20 1月, 2009 1 次提交

ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks · e7f07968

由 Theodore Ts'o 提交于 1月 20, 2009

When trying to unlink a file with indirect blocks on a filesystem
without a journal, the "circular indirect block" sanity test was
getting falsely triggered.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

e7f07968

18 1月, 2009 1 次提交

ext4: only use i_size_high for regular files · 06a279d6

由 Theodore Ts'o 提交于 1月 17, 2009

Directories are not allowed to be bigger than 2GB, so don't use
i_size_high for anything other than regular files.  E2fsck should
complain about these inodes, but the simplest thing to do for the
kernel is to only use i_size_high for regular files.

This prevents an intentially corrupted filesystem from causing the
kernel to burn a huge amount of CPU and issuing error messages such
as:

EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max

Thanks to David Maciejak from Fortinet's FortiGuard Global Security
Research Team for reporting this issue.

http://bugzilla.kernel.org/show_bug.cgi?id=12375Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org

06a279d6

07 1月, 2009 1 次提交

percpu_counter: FBC_BATCH should be a variable · 179f7ebf

由 Eric Dumazet 提交于 1月 06, 2009

For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS

Considering more and more distros are using high NR_CPUS values, it makes
sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS.

A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This
minimum value helps branch prediction in __percpu_counter_add())

We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically.

We rename FBC_BATCH to percpu_counter_batch since its not a constant
anymore.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

179f7ebf

05 1月, 2009 1 次提交

fs: symlink write_begin allocation context fix · 54566b2c

由 Nick Piggin 提交于 1月 04, 2009

With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened.  They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim.  This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock.  The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
this flag in their write_begin function.  Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
  untouched to the grab_cache_page_write_begin() function.  That
  just simplifies everybody, and may even allow future expansion of the
  logic.   - Linus ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54566b2c

04 1月, 2009 1 次提交
- T
  ext4: Add markers for better debuggability · ba80b101
  由 Theodore Ts'o 提交于 1月 03, 2009
```
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
  ba80b101
06 1月, 2009 1 次提交

ext4: Use high 16 bits of the block group descriptor's free counts fields · 560671a0

由 Aneesh Kumar K.V 提交于 1月 05, 2009

Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

560671a0

01 1月, 2009 1 次提交

ext4: ensure fast symlinks are NUL-terminated · e83c1397

由 Duane Griffin 提交于 12月 19, 2008

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: adilger@sun.com
Cc: linux-ext4@vger.kernel.org
Signed-off-by: NDuane Griffin <duaneg@dghda.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e83c1397

07 11月, 2008 1 次提交

ext4: Mark the buffer_heads as dirty and uptodate after prepare_write · ed9b3e33

由 Aneesh Kumar K.V 提交于 11月 07, 2008

We need to make sure we mark the buffer_heads as dirty and uptodate
so that block_write_full_page write them correctly.

This fixes mmap corruptions that can occur in low memory situations.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

ed9b3e33

23 11月, 2008 1 次提交

ext4: sparse fixes · 3a06d778

由 Aneesh Kumar K.V 提交于 11月 22, 2008

* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

3a06d778

07 11月, 2008 1 次提交

ext4: calculate journal credits correctly · ac51d837

由 Theodore Ts'o 提交于 11月 06, 2008

This fixes a 2.6.27 regression which was introduced in commit a02908f1.

We weren't passing the chunk parameter down to the two subections,
ext4_indirect_trans_blocks() and ext4_ext_index_trans_blocks(), with
the result that massively overestimate the amount of credits needed by
ext4_da_writepages, especially in the non-extents case. This causes
failures especially on /boot partitions, which tend to be small and
non-extent using since GRUB doesn't handle extents.

This patch fixes the bug reported by Joseph Fannin at:
http://bugzilla.kernel.org/show_bug.cgi?id=11964Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

ac51d837

05 11月, 2008 1 次提交

ext4: Change unsigned long to unsigned int · 498e5f24

由 Theodore Ts'o 提交于 11月 05, 2008

Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

498e5f24

07 1月, 2009 1 次提交

ext4: Allow ext4 to run without a journal · 0390131b

由 Frank Mayhar 提交于 1月 07, 2009

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.
Signed-off-by: NFrank Mayhar <fmayhar@google.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

0390131b

06 1月, 2009 1 次提交

ext4: Fix the delalloc writepages to allocate blocks at the right offset. · 791b7f08

由 Aneesh Kumar K.V 提交于 1月 05, 2009

When iterating through the pages which have mapped buffer_heads, we
failed to update the b_state value. This results in allocating blocks
at logical offset 0.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org

791b7f08

05 11月, 2008 1 次提交

ext4: tone down ext4_da_writepages warnings · 2a21e37e

由 Theodore Ts'o 提交于 11月 05, 2008

If the filesystem has errors, ext4_da_writepages() will return a *lot*
of errors, including lots and lots of stack dumps. While it's true
that we are dropping user data on the floor, which is unfortunate, the
stack dumps aren't helpful, and they tend to obscure the true original
root cause of the problem. So in the case where the filesystem has
aborted, return an EROFS right away.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

2a21e37e

02 1月, 2009 1 次提交

ext4: remove ext4_new_blocks() and call ext4_mb_new_blocks() directly · 815a1130

由 Theodore Ts'o 提交于 1月 01, 2009

There was only one caller of the compatibility function
ext4_new_blocks(), in balloc.c's ext4_alloc_blocks().  Change it to
call ext4_mb_new_blocks() directly, and remove ext4_new_blocks()
altogether.  This cleans up the code, by removing two extra functions
from the call chain, and hopefully saving some stack usage.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

815a1130

30 10月, 2008 1 次提交

ext4: fix printk format warning · 8f72fbdf

由 Alexander Beregalov 提交于 10月 29, 2008

fs/ext4/balloc.c:607: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64'
fs/ext4/inode.c:1822: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64'
fs/ext4/inode.c:1824: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64'
Signed-off-by: NAlexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

8f72fbdf

17 10月, 2008 1 次提交

ext4: Remove automatic enabling of the HUGE_FILE feature flag · f287a1a5

由 Theodore Ts'o 提交于 10月 16, 2008

If the HUGE_FILE feature flag is not set, don't allow the creation of
large files, instead of automatically enabling the feature flag.
Recent versions of mke2fs will set the HUGE_FILE flag automatically
anyway for ext4 filesystems.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

f287a1a5

16 10月, 2008 1 次提交

ext4: Fix file fragmentation during large file write. · 22208ded

由 Aneesh Kumar K.V 提交于 10月 16, 2008

The range_cyclic writeback mode uses the address_space writeback_index
as the start index for writeback.  With delayed allocation we were
updating writeback_index wrongly resulting in highly fragmented file.
This patch reduces the number of extents reduced from 4000 to 27 for a
3GB file.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

22208ded

14 10月, 2008 1 次提交

ext4: Use tag dirty lookup during mpage_da_submit_io · af6f029d

由 Aneesh Kumar K.V 提交于 10月 14, 2008

This enables us to drop the range_cont writeback mode
use from ext4_da_writepages.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

af6f029d

11 10月, 2008 1 次提交

ext4: Rename ext4dev to ext4 · 03010a33

由 Theodore Ts'o 提交于 10月 10, 2008

The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix.  Also remove the requirement for the TEST_FILESYS
flag.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

03010a33

07 10月, 2008 1 次提交

Hook ext4 to the vfs fiemap interface. · 6873fa0d

由 Eric Sandeen 提交于 10月 07, 2008

ext4_ext_walk_space() was reinstated to be used for iterating over file
extents with a callback; it is used by the ext4 fiemap implementation.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

6873fa0d

10 10月, 2008 2 次提交

T
ext4: Remove old legacy block allocator · c2ea3fde
由 Theodore Ts'o 提交于 10月 10, 2008
```
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
c2ea3fde

ext4: Use readahead when reading an inode from the inode table · 240799cd

由 Theodore Ts'o 提交于 10月 09, 2008

With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block.  So request readahead for adjacent inode table
blocks to reduce the time it takes when iterating over directories
(especially when doing this in htree sort order) in a cold cache case.
With this patch, the time it takes to run "git status" on a kernel
tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
is reduced by 21%.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

240799cd

14 9月, 2008 2 次提交

ext4: Properly update i_disksize. · cf17fea6

由 Aneesh Kumar K.V 提交于 9月 13, 2008

With delayed allocation we use i_data_sem to update i_disksize. We need
to update i_disksize only if the new size specified is greater than the
current value and we need to make sure we don't race with other
i_disksize update. With delayed allocation we will switch to the
write_begin function for non-delayed allocation if we are low on free
blocks. This means the write_begin function for non-delayed allocation
also needs to use the same locking.

We also need to check and update i_disksize even if the new size is less
that inode.i_size because of delayed allocation.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

cf17fea6

ext4: truncate block allocated on a failed ext4_write_begin · ae4d5372

由 Aneesh Kumar K.V 提交于 9月 13, 2008

For blocksize < pagesize we need to remove blocks that got allocated in
block_write_begin() if we fail with ENOSPC for later blocks.
block_write_begin() internally does this if it allocated pages locally.
This makes sure we don't have blocks outside inode.i_size during ENOSPC.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

ae4d5372

09 9月, 2008 2 次提交

ext4: Retry block allocation if we have free blocks left · df22291f

由 Aneesh Kumar K.V 提交于 9月 08, 2008

When we truncate files, the meta-data blocks released are not reused
untill we commit the truncate transaction.  That means delayed get_block
request will return ENOSPC even if we have free blocks left.  Force a
journal commit and retry block allocation if we get ENOSPC with free
blocks left.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

df22291f

ext4: Don't add the inode to journal handle until after the block is allocated · 166348dd

由 Aneesh Kumar K.V 提交于 9月 08, 2008

    
Make sure we don't add the inode to the journal handle until after the
block allocation, so that a journal commit will not include the inode in
case of block allocation failure.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

166348dd

09 10月, 2008 1 次提交

ext4: Switch to non delalloc mode when we are low on free blocks count. · 79f0be8d

由 Aneesh Kumar K.V 提交于 10月 08, 2008

The delayed allocation code allocates blocks during writepages(), which
can not handle block allocation failures.  To deal with this, we switch
away from delayed allocation mode when we are running low on free
blocks.  This also allows us to avoid needing to reserve a large number
of meta-data blocks in case all of the requested blocks are
discontiguous.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

79f0be8d

10 10月, 2008 1 次提交

ext4: Add percpu dirty block accounting. · 6bc6e63f

由 Aneesh Kumar K.V 提交于 10月 10, 2008

This patch adds dirty block accounting using percpu_counters.  Delayed
allocation block reservation is now done by updating dirty block
counter.  In a later patch we switch to non delalloc mode if the
filesystem free blocks is greater than 150% of total filesystem dirty
blocks
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

6bc6e63f

09 9月, 2008 1 次提交

ext4: Retry block reservation · 030ba6bc

由 Aneesh Kumar K.V 提交于 9月 08, 2008

During block reservation if we don't have enough blocks left, retry
block reservation with smaller block counts. This makes sure we try
fallocate and DIO with smaller request size and don't fail early. The
delayed allocation reservation cannot try with smaller block count. So
retry block reservation to handle temporary disk full conditions. Also
print free blocks details if we fail block allocation during writepages.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

030ba6bc

09 10月, 2008 1 次提交

ext4: Make sure all the block allocation paths reserve blocks · a30d542a

由 Aneesh Kumar K.V 提交于 10月 09, 2008

With delayed allocation we need to make sure block are reserved before
we attempt to allocate them. Otherwise we get block allocation failure
(ENOSPC) during writepages which cannot be handled. This would mean
silent data loss (We do a printk stating data will be lost). This patch
updates the DIO and fallocate code path to do block reservation before
block allocation. This is needed to make sure parallel DIO and fallocate
request doesn't take block out of delayed reserve space.

When free blocks count go below a threshold we switch to a slow patch
which looks at other CPU's accumulated percpu counter values.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

a30d542a

20 8月, 2008 1 次提交

ext4: invalidate pages if delalloc block allocation fails. · c4a0c46e

由 Aneesh Kumar K.V 提交于 8月 19, 2008

We are a bit agressive in invalidating all the pages. But
it is ok because we really don't know why the block allocation
failed and it is better to come of the writeback path
so that user can look for more info.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

c4a0c46e

09 9月, 2008 1 次提交
- T
  ext4: Fix whitespace checkpatch warnings/errors · af5bc92d
  由 Theodore Ts'o 提交于 9月 08, 2008
```
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
  af5bc92d
03 8月, 2008 1 次提交

ext4: remove write-only variables from ext4_ordered_write_end · 7d55992d

由 Eric Sandeen 提交于 8月 02, 2008

The variables 'from' and 'to' are not used anywhere.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Acked-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

7d55992d

29 7月, 2008 1 次提交

vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a

由 Hisashi Hifumi 提交于 7月 28, 2008

When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.

I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment.  Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.

So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate.  This can
reduce read IO and improve system throughput.

I wrote a benchmark program and got result number with this program.

This benchmark do:

  1: mount and open a test file.

  2: create a 512MB file.

  3: close a file and umount.

  4: mount and again open a test file.

  5: pwrite randomly 300000 times on a test file.  offset is aligned
     by IO size(1024bytes).

  6: measure time of preading randomly 100000 times on a test file.

The result was:
	2.6.26
        330 sec

	2.6.26-patched
        226 sec

Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB

On ext3/4, a file is written through buffer/block.  So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment.  This test result
showed this.

The benchmark program is as follows:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>

#define LEN 1024
#define LOOP 1024*512 /* 512MB */

main(void)
{
	unsigned long i, offset, filesize;
	int fd;
	char buf[LEN];
	time_t t1, t2;

	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	memset(buf, 0, LEN);
	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}
	for (i = 0; i < LOOP; i++)
		write(fd, buf, LEN);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
		perror("cannot mount\n");
		exit(1);
	}
	fd = open("/root/test1/testfile", O_RDWR);
	if (fd < 0) {
		perror("cannot open file\n");
		exit(1);
	}

	filesize = LEN * LOOP;
	for (i = 0; i < 300000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pwrite(fd, buf, LEN, offset);
	}
	printf("start test\n");
	time(&t1);
	for (i = 0; i < 100000; i++){
		offset = (random() % filesize) & (~(LEN - 1));
		pread(fd, buf, LEN, offset);
	}
	time(&t2);
	printf("%ld sec\n", t2-t1);
	close(fd);
	if (umount("/root/test1/") < 0) {
		perror("cannot umount\n");
		exit(1);
	}
}
Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ab22b9a

19 8月, 2008 1 次提交

ext4: Fix small file fragmentation · 5e745b04

由 Aneesh Kumar K.V 提交于 8月 18, 2008

For small file block allocations, mballoc uses per cpu prealloc
space.  Use goal block when searching for the right prealloc
space.  Also make sure ext4_da_writepages tries to write
all the pages for small files in single attempt
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

5e745b04

20 8月, 2008 2 次提交

ext4: journal credit fix for the delayed allocation's writepages() function · 525f4ed8

由 Mingming Cao 提交于 8月 19, 2008

Previous delalloc writepages implementation started a new transaction
outside of a loop which called get_block() to do the block allocation.
Since we didn't know exactly how many blocks would need to be allocated,
the estimated journal credits required was very conservative and caused
many issues.

With the reworked delayed allocation, a new transaction is created for
each get_block(), thus we don't need to guess how many credits for the
multiple chunk of allocation.  We start every transaction with enough
credits for inserting a single exent.  When estimate the credits for
indirect blocks to allocate a chunk of blocks, we need to know the
number of data blocks to allocate.  We use the total number of reserved
delalloc datablocks; if that is too big, for non-extent files, we need
to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS.

Code cleanup from Aneesh.
Signed-off-by: NMingming Cao <cmm@us.ibm.com>
Reviewed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

525f4ed8

ext4: Rework the ext4_da_writepages() function · a1d6cc56

由 Aneesh Kumar K.V 提交于 8月 19, 2008

With the below changes we reserve credit needed to insert only one
extent resulting from a call to single get_block. This makes sure we
don't take too much journal credits during writeout. We also don't
limit the pages to write. That means we loop through the dirty pages
building largest possible contiguous block request. Then we issue a
single get_block request. We may get less block that we requested. If
so we would end up not mapping some of the buffer_heads. That means
those buffer_heads are still marked delay. Later in the writepage
callback via __mpage_writepage we redirty those pages.

We should also not limit/throttle wbc->nr_to_write in the filesystem
writepages callback. That cause wrong behaviour in
generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NMingming Cao <cmm@us.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

a1d6cc56