提交 · 7a4566b0b8fa67c7cd7be9f2969a085920356abb · openanolis / cloud-kernel

28 10月, 2010 40 次提交

fs/9p: Add xattr callbacks for POSIX ACL · 7a4566b0

由 Aneesh Kumar K.V 提交于 9月 28, 2010

This patch implement fetching POSIX ACL from the server
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

7a4566b0

fs/9p: Implement POSIX ACL permission checking function · 85ff872d

由 Aneesh Kumar K.V 提交于 9月 28, 2010

The ACL value is fetched as a part of inode initialization
from the server and the permission checking function use the
cached value of the ACL
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

85ff872d

fs/9p: Remove the redundant rsize calculation in v9fs_file_write() · 8d40fa24

由 jvrao 提交于 8月 30, 2010

the same calculation is done in p9_client_write
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

8d40fa24

9p: Add a Direct IO support for non-cached operations. · 3e24ad2f

由 jvrao 提交于 8月 24, 2010

The presence of v9fs_direct_IO() in the address space ops vector
allowes open() O_DIRECT flags which would have failed otherwise.

In the non-cached mode, we shunt off direct read and write requests before
the VFS gets them, so this method should never be called.

Direct IO is not 'yet' supported in the cached mode. Hence when
this routine is called through generic_file_aio_read(), the read/write fails
with an error.
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

3e24ad2f

fs/9p: mkdir fix for setting S_ISGID bit as per parent directory · 7c7298cf

由 Harsh Prateek Bora 提交于 8月 18, 2010

The current implementation of 9p client mkdir function does not
set the S_ISGID mode bit for the directory being created if the
parent directory has this bit set. This patch fixes this problem
so that the newly created directory inherits the gid from parent
directory and not from the process creating this directory, when
the S_ISGID bit is set in parent directory.
Signed-off-by: NHarsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

7c7298cf

9p: Pass the correct end of buffer to p9dirent_read · 8812a3d5

由 Sripathi Kodi 提交于 8月 09, 2010

A patch was accepted recently for sending correct buffer size to p9stat_read.
We need a similar patch in v9fs_dir_readdir_dotl to send correct end of buffer
to p9dirent_read.
Signed-off-by: NSripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

8812a3d5

fs/9p: setrlimit fix for 9p write · 3834b12a

由 Harsh Prateek Bora 提交于 8月 03, 2010

Current 9p client file write code does not check for RLIMIT_FSIZE resource.
This bug was found by running LTP test case for setrlimit. This bug is fixed
by calling generic_write_checks before sending the write request to the
server.
Without this patch: the write function is allowed to write above the
RLIMIT_FSIZE set by user.
With this patch: the write function checks for RLIMIT_SIZE and writes upto
the size limit.
Signed-off-by: NHarsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

3834b12a

9p: remove unneeded checks · 57ee047b

由 Dan Carpenter 提交于 8月 04, 2010

git_t is unsigned an can never be less than zero.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

57ee047b

ext4: optimize orphan_list handling for ext4_setattr · 3d287de3

由 Dmitry Monakhov 提交于 10月 27, 2010

Surprisingly chown() on ext4 is not SMP scalable operation. 
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
   iteration++;
   fchown(fd, uid, gid);
   fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
|        1 |       142 |        185 |
|        4 |       109 |        642 |
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

3d287de3

N
ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new · beed5ecb
由 Nicolas Kaiser 提交于 10月 27, 2010
```
Signed-off-by: NNicolas Kaiser <nikai@nikai.net>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
beed5ecb

ext4: fix compile error in ext4_fallocate() · a6371b63

由 Kazuya Mio 提交于 10月 27, 2010

When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.

  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1

The patch fixes this problem.
Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

a6371b63

ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static · eee4adc7

由 Eric Sandeen 提交于 10月 27, 2010

These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

eee4adc7

T
ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() · 61d08673
由 Theodore Ts'o 提交于 10月 27, 2010
```
Fix a namespace leak from fs/ext4
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
61d08673

ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static · 4a873a47

由 Theodore Ts'o 提交于 10月 27, 2010

Fix a namespace leak by moving the function to the file where it is
used and making it static.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

4a873a47

ext4: rename {ext,idx}_pblock and inline small extent functions · bf89d16f

由 Theodore Ts'o 提交于 10月 27, 2010

Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

bf89d16f

ext4: make various ext4 functions be static · 1f109d5a

由 Theodore Ts'o 提交于 10月 27, 2010

These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1f109d5a

T
ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() · 5dabfc78
由 Theodore Ts'o 提交于 10月 27, 2010
```
This is a cleanup to avoid namespace leaks out of fs/ext4
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
```
5dabfc78

ext4: fix kernel oops if the journal superblock has a non-zero j_errno · 7f93cff9

由 Theodore Ts'o 提交于 10月 27, 2010

Commit 84061e07 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

7f93cff9

ext4: update writeback_index based on last page scanned · 72f84e65

由 Eric Sandeen 提交于 10月 27, 2010

As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.

As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout.  If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.

Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

72f84e65

ext4: implement writeback livelock avoidance using page tagging · 5b41d924

由 Eric Sandeen 提交于 10月 27, 2010

This is analogous to Jan Kara's commit,
f446daae
mm: implement writeback livelock avoidance using page tagging

but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)

If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.

If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.

This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...

With the following patch, we only sync as much as was
dirty at the time of the sync.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

5b41d924

ext4: tidy up a void argument in inode.c · bbd08344

由 Eric Sandeen 提交于 10月 27, 2010

This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()

__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.

Fixing this up makes it slightly more obvious to read, and 
enables proper typechecking.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

bbd08344

ext4: add batched_discard into ext4 feature list · 27ee40df

由 Lukas Czerner 提交于 10月 27, 2010

Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

27ee40df

ext4: Add batched discard support for ext4 · 7360d173

由 Lukas Czerner 提交于 10月 27, 2010

Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

7360d173

fs: Add FITRIM ioctl · 367a51a3

由 Lukas Czerner 提交于 10月 27, 2010

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
	start;
	len;
	minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
	  number of Bytes will be ignored. This will be rounded up to fs
	  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

367a51a3

ext4: Use return value from sb_issue_discard() · 77ca6cdf

由 Lukas Czerner 提交于 10月 27, 2010

Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

77ca6cdf

ext4: Check return value of sb_getblk() and friends · 87783690

由 Namhyung Kim 提交于 10月 27, 2010

Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

87783690

ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210

由 Theodore Ts'o 提交于 10月 27, 2010

Call the block I/O layer directly instad of going through the buffer
layer. This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

bd2d0210

ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() · 1de3e3df

由 Theodore Ts'o 提交于 10月 27, 2010

This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

1de3e3df

ext4: inline walk_page_buffers() into mpage_da_submit_io · 3ecdb3a1

由 Theodore Ts'o 提交于 10月 27, 2010

Expand the call:

  if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                        ext4_bh_delay_or_unwritten))
	goto redirty_page

into mpage_da_submit_io().

This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

3ecdb3a1

ext4: inline ext4_writepage() into mpage_da_submit_io() · cb20d518

由 Theodore Ts'o 提交于 10月 27, 2010

As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit.  This makes it clearer what mpage_da_submit needs to do.

Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.

This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

cb20d518

ext4: simplify ext4_writepage() · a42afc5f

由 Theodore Ts'o 提交于 10月 27, 2010

The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

a42afc5f

ext4: call mpage_da_submit_io() from mpage_da_map_blocks() · 5a87b7a5

由 Theodore Ts'o 提交于 10月 27, 2010

Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().

We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

5a87b7a5

ext4: use KMEM_CACHE instead of kmem_cache_create · 16828088

由 Theodore Ts'o 提交于 10月 27, 2010

Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache.  This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

16828088

ext4: use search_dirblock() in ext4_dx_find_entry() · 7845c049

由 Theodore Ts'o 提交于 10月 27, 2010

Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
easier to read, and it takes advantage of common code.  It also saves
100 bytes or so of text space.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>

7845c049

ext4: avoid uninitialized memory references in ext3_htree_next_block() · 8941ec8b

由 Theodore Ts'o 提交于 10月 27, 2010

If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code.  The warning was harmless, but it was
useful in pointing out code that was too ugly to live.  This warning was
also reported by Roman Borisov.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>

8941ec8b

ext4: remove unused ext4_sb_info members · 640e9396

由 Eric Sandeen 提交于 10月 27, 2010

Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...
Signed-off-by: NEric Sandeen <sandeen@redaht.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

640e9396

ext4: queue conversion after adding to inode's completed IO list · c999af2b

由 Eric Sandeen 提交于 10月 27, 2010

By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.

It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.

Thanks to Dave Chinner for pointing out the race.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NJiaying Zhang <jiayingz@google.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

c999af2b

ext4: don't use ext4_allocation_contexts for tracing · 3e1e5f50

由 Eric Sandeen 提交于 10月 27, 2010

Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

3e1e5f50

ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc

由 Toshiyuki Okajima 提交于 10月 27, 2010

On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:

========================================================================
#!/bin/sh

echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================

We can see the following backtrace if we get the kdump when this
hangup occurs:

======================================================================
kthread()
=> bdi_writeback_thread()
   => wb_do_writeback()
      => wb_writeback()
         => writeback_inodes_wb()
            => writeback_sb_inodes()
               => writeback_single_inode()
                  => ext4_da_writepages()  ---+ 
                                ^ infinite    |
                                |   loop      |
                                +-------------+
======================================================================

The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem 
   maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
   - the member, "m_len" of struct mpage_da_data is 1
  mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
  cannot clear "BH_Delay" flag of the buffer_head because the type of
  m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.

  Therefore an infinite loop occurs because ext4_da_writepages()
  cannot write the page (which corresponds to the block) since
  "BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
				struct ext4_map_blocks *map)
{
...
	int blocks = map->m_len;
...
		do {
			// cur_logical = 4294967295
			// map->m_lblk = 4294967295
			// blocks = 1
			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
			// (cur_logical >= map->m_lblk + blocks) => true
			if (cur_logical >= map->m_lblk + blocks)
				break;
----------------------------------------------------------------------

NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang
Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

0c9169cc

ext4: improve llseek error handling for overly large seek offsets · e0d10bfa

由 Toshiyuki Okajima 提交于 10月 27, 2010

The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.


If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
written (write systemcall) is different from the maximum size which can be 
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>

e0d10bfa

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功