提交 · 695fd1ed3bcaae9fc032cbe47f0fe9a934bf1717 · openeuler / Kernel

27 2月, 2014 5 次提交

f2fs: use existing macro to clean up some codes · 695fd1ed

由 Chao Yu 提交于 2月 27, 2014

This patch use existing macro F2FS_INODE/NEXT_FREE_BLKADDR to clean up some
codes.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

695fd1ed

f2fs: readahead contiguous SSA blocks for f2fs_gc · 81c1a0f1

由 Chao Yu 提交于 2月 27, 2014

If there are multi segments in one section, we will read those SSA blocks which
have contiguous address one by one in f2fs_gc. It may lost performance, let's
read ahead SSA blocks by merge multi read request.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

81c1a0f1

f2fs: add an sysfs entry to control the directory level · ab9fa662

由 Jaegeuk Kim 提交于 2月 27, 2014

This patch adds an sysfs entry to control dir_level used by the large directory.

The description of this entry is:

 dir_level                    This parameter controls the directory level to
			      support large directory. If a directory has a
			      number of files, it can reduce the file lookup
			      latency by increasing this dir_level value.
			      Otherwise, it needs to decrease this value to
			      reduce the space overhead. The default value is 0.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

ab9fa662

f2fs: introduce large directory support · 38431545

由 Jaegeuk Kim 提交于 2月 27, 2014

This patch introduces an i_dir_level field to support large directory.

Previously, f2fs maintains multi-level hash tables to find a dentry quickly
from a bunch of chiild dentries in a directory, and the hash tables consist of
the following tree structure as below.

In Documentation/filesystems/f2fs.txt,

----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------

level #0   | A(2B)
           |
level #1   | A(2B) - A(2B)
           |
level #2   | A(2B) - A(2B) - A(2B) - A(2B)
     .     |   .       .       .       .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)

But, if we can guess that a directory will handle a number of child files,
we don't need to traverse the tree from level #0 to #N all the time.
Since the lower level tables contain relatively small number of dentries,
the miss ratio of the target dentry is likely to be high.

In order to avoid that, we can configure the hash tables sparsely from level #0
like this.

level #0   | A(2B) - A(2B) - A(2B) - A(2B)

level #1   | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
     .     |   .       .       .       .
level #N   | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)

With this structure, we can skip the ineffective tree searches in lower level
hash tables.

This patch adds just a facility for this by introducing i_dir_level in
f2fs_inode.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

38431545

f2fs: remove costly bit operations for f2fs_find_entry · 5d0c6671

由 Jaegeuk Kim 提交于 2月 27, 2014

It turns out that a bit operation like find_next_bit is not always fast enough
for f2fs_find_entry.
Instead, it is pretty much simple and fast to traverse each dentries.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

5d0c6671

24 2月, 2014 6 次提交

f2fs: implement a lock-free stat_show · 8b8343fa

由 Jaegeuk Kim 提交于 2月 24, 2014

The stat_show is just to show the current status of f2fs.
So, we can remove all the there-in locks.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

8b8343fa

f2fs: introduce a radix_tree for the free_nid list · 8a7ed66a

由 Jaegeuk Kim 提交于 2月 21, 2014

This patch introduces a radix tree for the list of free_nids, which enhances
the performance on free nid management.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

8a7ed66a

f2fs: introduce help macro on_build_free_nids() · f978f5a0

由 Gu Zheng 提交于 2月 21, 2014

Introduce help macro on_build_free_nids() which just uses build_lock
to judge whether the building free nid is going, so that we can remove
the on_build_free_nids field from f2fs_sb_info.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: remove an unnecessary white line removal]
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

f978f5a0

f2fs: fix to mark the checkpointed nat entry correctly · fffc2a00

由 Jaegeuk Kim 提交于 2月 21, 2014

The nat cache entry maintains a status whether it is checkpointed or not.
So, if a new cache entry is loaded from the last checkpoint,
nat_entry->checkpointed should be true.
If the cache entry is modified as being dirty, nat_entry->checkpoint should
be false.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

fffc2a00

f2fs: fix to do build_stat prior to the recovery procedure · 6437d1b0

由 Jaegeuk Kim 提交于 2月 19, 2014

At the end of the recovery procedure, write_checkpoint is called and updates
the cp count which is managed by f2fs stat.
But, previously build_stat() is called after the recovery procedure, which
results in:

BUG: unable to handle kernel NULL pointer dereference at 000000000000012c
IP: [<ffffffffa03b1030>] write_checkpoint+0x720/0xbc0 [f2fs]
Call Trace:
 [<ffffffff810a6b44>] ? mark_held_locks+0x74/0x140
 [<ffffffff8109a3e0>] ? __init_waitqueue_head+0x60/0x60
 [<ffffffffa03bf036>] recover_fsync_data+0x656/0xf20 [f2fs]
 [<ffffffff812ee3eb>] ? security_d_instantiate+0x1b/0x30
 [<ffffffffa03aeb4d>] f2fs_fill_super+0x94d/0xa00 [f2fs]
 [<ffffffff811a9825>] mount_bdev+0x1a5/0x1f0
 [<ffffffff8114915e>] ? __get_free_pages+0xe/0x40
 [<ffffffffa03ae200>] ? f2fs_remount+0x130/0x130 [f2fs]
 [<ffffffffa03aa575>] f2fs_mount+0x15/0x20 [f2fs]
 [<ffffffff811aa713>] mount_fs+0x43/0x1b0
 [<ffffffff811c7124>] vfs_kern_mount+0x74/0x160
 [<ffffffff811c5cb1>] ? __get_fs_type+0x51/0x60
 [<ffffffff811c9727>] do_mount+0x237/0xb50
 [<ffffffff811c936a>] ? copy_mount_options+0x3a/0x170

So, this patche changes the order of recovery_fsync_data() and
f2fs_build_stats().
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

6437d1b0

f2fs: fix not to write data pages on the page reclaiming path · 8618b881

由 Jaegeuk Kim 提交于 2月 17, 2014

Even if f2fs_write_data_page is called by the page reclaiming path, we should
not write the page to provide enough free segments for the worst case scenario.
Otherwise, f2fs can face with no free segment while gc is conducted, resulting
in:

 ------------[ cut here ]------------
 kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/segment.c:565!
 RIP: 0010:[<ffffffffa02c3b11>]  [<ffffffffa02c3b11>] new_curseg+0x331/0x340 [f2fs]
 Call Trace:
  allocate_segment_by_default+0x204/0x280 [f2fs]
  allocate_data_block+0x108/0x210 [f2fs]
  write_data_page+0x8a/0xc0 [f2fs]
  do_write_data_page+0xe1/0x2a0 [f2fs]
  move_data_page+0x8a/0xf0 [f2fs]
  f2fs_gc+0x446/0x970 [f2fs]
  f2fs_balance_fs+0xb6/0xd0 [f2fs]
  f2fs_write_begin+0x50/0x350 [f2fs]
  ? unlock_page+0x27/0x30
  ? unlock_page+0x27/0x30
  generic_file_buffered_write+0x10a/0x280
  ? file_update_time+0xa3/0xf0
  __generic_file_aio_write+0x1c8/0x3d0
  ? generic_file_aio_write+0x52/0xb0
  ? generic_file_aio_write+0x52/0xb0
  generic_file_aio_write+0x65/0xb0
  do_sync_write+0x5a/0x90
  vfs_write+0xc5/0x1f0
  SyS_write+0x55/0xa0
  system_call_fastpath+0x16/0x1b
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

8618b881

17 2月, 2014 14 次提交

f2fs: fix the calculation of max_nids · b63da15e

由 Jaegeuk Kim 提交于 2月 17, 2014

Total nids that f2fs can use should not include 0, nid for node inode, and nid
for meta inode.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

b63da15e

f2fs: show counts of checkpoint in status · 942e0be6

由 Changman Lee 提交于 2月 13, 2014

This patch shows the counts of checkpoint in f2fs' status.
Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

942e0be6

f2fs: introduce ra_meta_pages to readahead CP/NAT/SIT pages · 662befda

由 Chao Yu 提交于 2月 07, 2014

This patch help us to cleanup the readahead code by merging ra_{sit,nat}_pages
function into ra_meta_pages.
Additionally the new function is used to readahead cp block in
recover_orphan_inodes.

Change log from v1:
 o fix a deadloop bug pointed by Jaegeuk Kim.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

662befda

f2fs: use inode mutex to keep atomicity of f2fs_falloc · 3375f696

由 Chao Yu 提交于 1月 28, 2014

Previously without protection of inode mutex, f2fs_falloc and other data
correlated operations will interfere with each other.
So let's use inode mutex to keep atomicity of f2fs_falloc.
Signed-off-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

3375f696

f2fs: clean up redundant function call · 1fe54f9d

由 Jaegeuk Kim 提交于 2月 07, 2014

This patch integrates inode_[inc|dec]_dirty_dents with inc_page_count to remove
redundant calls.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

1fe54f9d

f2fs: fix f2fs_write_meta_page at no checkpoint status · 203681f6

由 Jaegeuk Kim 提交于 2月 05, 2014

If f2fs entered errorneous checkpoint status, it should skip writing meta
pages instead of redirtying the pages out.
Otherwise, it cannot unmount the partition even though f2fs is under read-only
status.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

203681f6

f2fs: fix to truncate dentry pages in the error case · bd859c65

由 Jaegeuk Kim 提交于 2月 05, 2014

When a new directory is allocated, if an error is occurred, we should truncate
preallocated dentry pages too.

This bug was reported by Andrey Tsyvarev after a while as follows.

mkdir()->
 f2fs_add_link()->
  init_inode_metadata()->
    f2fs_init_acl()->
      f2fs_get_acl()->
        f2fs_getxattr()->
          read_all_xattrs() fails.

Also there was a BUG_ON triggered after the fault in
mkdir()->
 f2fs_add_link()->
   init_inode_metadata()->
    remove_inode_page() ->
      f2fs_bug_on(inode->i_blocks != 0 && inode->i_blocks != 1);

But, previous patch wasn't perfect to resolve that bug, so the following bug
report was also submitted.

kernel BUG at fs/f2fs/inode.c:274!
Call Trace:
 [<ffffffff811fde03>] evict+0xa3/0x1a0
 [<ffffffff811fe615>] iput+0xf5/0x180
 [<ffffffffa01c7f63>] f2fs_mkdir+0xf3/0x150 [f2fs]
 [<ffffffff811f2a77>] vfs_mkdir+0xb7/0x160
 [<ffffffff811f36bf>] SyS_mkdir+0x5f/0xc0
 [<ffffffff81680769>] system_call_fastpath+0x16/0x1b

Finally, this patch resolves all the issues like below.

If an error is occurred after make_empty_dir(),
 1. truncate_inode_pages()
   The make_bad_inode() prior to iput() will change i_mode to S_IFREG, which
   means that f2fs will not decrement fi->dirty_dents during f2fs_evict_inode.
   But, by calling it here, we can do that.

 2. truncate_blocks()
   Preallocated dentry pages are trucated here to sync i_blocks.

 3. remove_dirty_dir_inode()
   Remove this directory inode from the list.
Reported-and-Tested-by: NAndrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

bd859c65

f2fs: fix a build warning · f6517cfc

由 Jaegeuk Kim 提交于 1月 28, 2014

This patch modifies flow a little bit to avoid the following build warnings.

src/fs/f2fs/recovery.c: In function ‘check_index_in_prev_nodes’:
src/fs/f2fs/recovery.c:288:51: warning: ‘sum.<U5390>.<U52f8>.ofs_in_node’ may
be used uninitialized in this function [-Wmaybe-uninitialized]
src/fs/f2fs/recovery.c:260:23: warning: ‘sum.nid’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

f6517cfc

f2fs: clean up with a macro · 491c0854

由 Jaegeuk Kim 提交于 2月 04, 2014

This patch adds GET_BLKOFF_FROM_SEG0 to clean up some codes.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

491c0854

f2fs: fix the potential mismatch between dir's i_size and i_blocks · 924a2ddb

由 Jaegeuk Kim 提交于 2月 03, 2014

This is the erroneous scenario.

                             i_size    on-disk i_size    i_blocks
__f2fs_add_link()             4096           4096           2
 get_new_data_page            8192           4096           3
 -ENOSPC = init_inode_metadata
 checkpoint                     -            4096           3
 POR and reboot

__f2fs_add_link()             4096           4096           3
 page = get_new_data_page (page->index = 1 by NEW_ADDR)
 add a dentry to the page successfully

f2fs_rmdir()
 f2fs_empty_dir()             4096           4096           3
 f2fs_unlink() goes, since there is no valid dentry due to i_size = 4096.
 But, still there is one dentry in page->index = 1.

So this patch moves the code to write dir->i_size into on-disk i_size in order
to sync dir's i_size, on-disk i_size, and its i_blocks.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

924a2ddb

f2fs: remove the ugly pointer conversion · 1b1f559f

由 Jaegeuk Kim 提交于 2月 03, 2014

This patch modifies the use of bi_private to remove pointer chasing for sbi.
Previously, we had a bi_private structure, but it needs memory allocation.
So this patch uses bi_private by the sbi pointer and adds a completion pointer
into the sbi.
This can achieve no memory allocation and nice use of the bi_private.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

1b1f559f

f2fs: fix to recover xattr node block · abb2366c

由 Jaegeuk Kim 提交于 1月 28, 2014

If a new xattr node page was allocated and its inode is fsynced, we should
recover the xattr node page during the roll-forward process after power-cut.
But, previously, f2fs didn't handle that case, resulting in kernel panic as
follows reported by Tom Li.

BUG: unable to handle kernel paging request at ffffc9001c861a98
IP: [<ffffffffa0295236>] check_index_in_prev_nodes+0x86/0x2d0 [f2fs]
Call Trace:
 [<ffffffff815ece9b>] ? printk+0x48/0x4a
 [<ffffffffa029626a>] recover_fsync_data+0xdca/0xf50 [f2fs]
 [<ffffffffa02873ae>] f2fs_fill_super+0x92e/0x970 [f2fs]
 [<ffffffff8112c9f8>] mount_bdev+0x1b8/0x200
 [<ffffffffa0286a80>] ? f2fs_remount+0x130/0x130 [f2fs]
 [<ffffffffa0285e40>] f2fs_mount+0x10/0x20 [f2fs]
 [<ffffffff8112d4de>] mount_fs+0x3e/0x1b0
 [<ffffffff810ef4eb>] ? __alloc_percpu+0xb/0x10
 [<ffffffff8114761f>] vfs_kern_mount+0x6f/0x120
 [<ffffffff811497b9>] do_mount+0x259/0xa90
 [<ffffffff810ead1d>] ? memdup_user+0x3d/0x80
 [<ffffffff810eadb3>] ? strndup_user+0x53/0x70
 [<ffffffff8114a2c9>] SyS_mount+0x89/0xd0
 [<ffffffff815feae2>] system_call_fastpath+0x16/0x1b

This patch adds a recovery function of xattr node pages.
Reported-by: NTom Li <biergaizi@members.fsf.org>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

abb2366c

f2fs: handle dirty segments inside refresh_sit_entry · 5e443818

由 Jaegeuk Kim 提交于 1月 28, 2014

This patch cleans up the refresh_sit_entry to handle locate_dirty_segments.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

5e443818

f2fs: update_inode_page should be done all the time · 744602cf

由 Jaegeuk Kim 提交于 1月 24, 2014

In order to make fs consistency, update_inode_page should not be failed all
the time. Otherwise, it is possible to lose some metadata in the inode like
a link count.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

744602cf

16 2月, 2014 2 次提交

Btrfs: use right clone root offset for compressed extents · 93de4ba8

由 Filipe David Borba Manana 提交于 2月 15, 2014

For non compressed extents, iterate_extent_inodes() gives us offsets
that take into account the data offset from the file extent items, while
for compressed extents it doesn't. Therefore we have to adjust them before
placing them in a send clone instruction. Not doing this adjustment leads to
the receiving end requesting for a wrong a file range to the clone ioctl,
which results in different file content from the one in the original send
root.

Issue reproducible with the following excerpt from the test I made for
xfstests:

  _scratch_mkfs
  _scratch_mount "-o compress-force=lzo"

  $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo

  $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1

  $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
  $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo

  # will be used for incremental send to be able to issue clone operations
  $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap

  $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2

  $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
  $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
      -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
  $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
      -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2

  $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
  $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
  $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
      -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap

  _scratch_unmount
  _scratch_mkfs
  _scratch_mount

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
  $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
  $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
  $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NChris Mason <clm@fb.com>

93de4ba8

btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 · f085381e

由 Anand Jain 提交于 1月 15, 2014

bdev is null when disk has disappeared and mounted with
the degrade option

stack trace
---------
btrfs_sysfs_add_one+0x105/0x1c0 [btrfs]
open_ctree+0x15f3/0x1fe0 [btrfs]
btrfs_mount+0x5db/0x790 [btrfs]
? alloc_pages_current+0xa4/0x160
mount_fs+0x34/0x1b0
vfs_kern_mount+0x62/0xf0
do_mount+0x22e/0xa80
? __get_free_pages+0x9/0x40
? copy_mount_options+0x31/0x170
SyS_mount+0x7e/0xc0
system_call_fastpath+0x16/0x1b
---------

reproducer:
-------
mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd
(detach a disk)
devmgt detach /dev/sdc [1]
mount -o degrade /dev/sdd /btrfs
-------

[1] github.com/anajain/devmgt.git
Signed-off-by: NAnand Jain <Anand.Jain@oracle.com>
Tested-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

f085381e

15 2月, 2014 4 次提交

Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a

由 Josef Bacik 提交于 2月 14, 2014

A user was running into errors from an NFS export of a subvolume that had a
default subvol set. When we mount a default subvol we will use d_obtain_alias()
to find an existing dentry for the subvolume in the case that the root subvol
has already been mounted, or a dummy one is allocated in the case that the root
subvol has not already been mounted. This allows us to connect the dentry later
on if we wander into the path. However if we don't ever wander into the path we
will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't
appear to cause any problems but it is annoying nonetheless, so simply unset
DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
use d_materialise_unique() instead which will make everything play nicely
together and reconnect stuff if we wander into the defaul subvol path from a
different way. With this patch I'm no longer getting the NFS errors when
exporting a volume that has been mounted with a default subvol set. Thanks,

cc: bfields@fieldses.org
cc: ebiederm@xmission.com
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NChris Mason <clm@fb.com>

3a0dfa6a

Btrfs: fix max_inline mount option · feb5f965

由 Mitch Harder 提交于 2月 13, 2014

Currently, the only mount option for max_inline that has any effect is
max_inline=0.  Any other value that is supplied to max_inline will be
adjusted to a minimum of 4k.  Since max_inline has an effective maximum
of ~3900 bytes due to page size limitations, the current behaviour
only has meaning for max_inline=0.

This patch will allow the the max_inline mount option to accept non-zero
values as indicated in the documentation.
Signed-off-by: NMitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: NChris Mason <clm@fb.com>

feb5f965

Btrfs: fix a lockdep warning when cleaning up aborted transaction · a9d2d4ad

由 Liu Bo 提交于 2月 08, 2014

Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,

[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
[ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
[ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
[ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
[ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
[ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
[ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
[ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
[ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] ------------[ cut here ]------------

----------------------------------------------------------------------

The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.

So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.

Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

a9d2d4ad

Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89

由 Chris Mason 提交于 2月 14, 2014

This reverts commit 01e219e8.

David Sterba found a different way to provide these features without adding a new
ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
for now until we finalize things.
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
CC: Jeff Mahoney <jeffm@suse.com>

11bcac89

14 2月, 2014 1 次提交

lockd: send correct lock when granting a delayed lock. · 2ec197db

由 NeilBrown 提交于 2月 07, 2014

If an NFS client attempts to get a lock (using NLM) and the lock is
not available, the server will remember the request and when the lock
becomes available it will send a GRANT request to the client to
provide the lock.

If the client already held an adjacent lock, the GRANT callback will
report the union of the existing and new locks, which can confuse the
client.

This happens because __posix_lock_file (called by vfs_lock_file)
updates the passed-in file_lock structure when adjacent or
over-lapping locks are found.

To avoid this problem we take a copy of the two fields that can
be changed (fl_start and fl_end) before the call and restore them
afterwards.
An alternate would be to allocate a 'struct file_lock', initialise it,
use locks_copy_lock() to take a copy, then locks_release_private()
after the vfs_lock_file() call.  But that is a lot more work.
Reported-by: NOlaf Kirch <okir@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

--
v1 had a couple of issues (large on-stack struct and didn't really work properly).
This version is much better tested.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

2ec197db

12 2月, 2014 1 次提交

nfsd4: fix acl buffer overrun · 09bdc2d7

由 J. Bruce Fields 提交于 2月 11, 2014

4ac7249e "nfsd: use get_acl and
->set_acl" forgets to set the size in the case get_acl() succeeds, so
_posix_to_nfsv4_one() can then write past the end of its allocation.
Symptoms were slab corruption warnings.

Also, some minor cleanup while we're here.  (Among other things, note
that the first few lines guarantee that pacl is non-NULL.)
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

09bdc2d7

11 2月, 2014 7 次提交

block: Fix cloning of discard/write same bios · 8423ae3d

由 Kent Overstreet 提交于 2月 10, 2014

Immutable biovecs changed the way bio segments are treated in such a way that
bio_for_each_segment() cannot now do what we want for discard/write same bios,
since bi_size means something completely different for them.

Fortunately discard and write same bios never have more than a single biovec, so
bio_for_each_segment() is unnecessary and not terribly meaningful for them, but
we still have to special case them in a few places.
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
Tested-by: NRichard W.M. Jones <rjones@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

8423ae3d

ocfs2: check existence of old dentry in ocfs2_link() · 0e048316

由 Xue jiufei 提交于 2月 10, 2014

System call linkat first calls user_path_at(), check the existence of
old dentry, and then calls vfs_link()->ocfs2_link() to do the actual
work.  There may exist a race when Node A create a hard link for file
while node B rm it.

         Node A                          Node B
user_path_at()
  ->ocfs2_lookup(),
find old dentry exist
                                rm file, add inode say inodeA
                                to orphan_dir

call ocfs2_link(),create a
hard link for inodeA.

                                rm the link, add inodeA to orphan_dir
                                again

When orphan_scan work start, it calls ocfs2_queue_orphans() to do the
main work.  It first tranverses entrys in orphan_dir, linking all inodes
in this orphan_dir to a list look like this:

	inodeA->inodeB->...->inodeA

When tranvering this list, it will fall into loop, calling iput() again
and again.  And finally trigger BUG_ON(inode->i_state & I_CLEAR).
Signed-off-by: Njoyce <xuejiufei@huawei.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0e048316

ocfs2: update inode size after zeroing the hole · c7d2cbc3

由 Junxiao Bi 提交于 2月 10, 2014

fs-writeback will release the dirty pages without page lock whose offset
are over inode size, the release happens at
block_write_full_page_endio().  If not update, dirty pages in file holes
may be released before flushed to the disk, then file holes will contain
some non-zero data, this will cause sparse file md5sum error.

To reproduce the bug, find a big sparse file with many holes, like vm
image file, its actual size should be bigger than available mem size to
make writeback work more frequently, tar it with -S option, then keep
untar it and check its md5sum again and again until you get a wrong
md5sum.
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Cc: Younger Liu <younger.liu@huawei.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c7d2cbc3

ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size · d62e74be

由 Younger Liu 提交于 2月 10, 2014

The issue scenario is as following:

- Create a small file and fallocate a large disk space for a file with
  FALLOC_FL_KEEP_SIZE option.

- ftruncate the file back to the original size again.  but the disk free
  space is not changed back.  This is a real bug that be fixed in this
  patch.

In order to solve the issue above, we modified ocfs2_setattr(), if
attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
truncate disk space to attr->ia_size.
Signed-off-by: NYounger Liu <younger.liu@huawei.com>
Reviewed-by: NJie Liu <jeff.liu@oracle.com>
Tested-by: NJie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: NJensen <shencanquan@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d62e74be

mm: fix page leak at nfs_symlink() · a0b54add

由 Rafael Aquini 提交于 2月 10, 2014

Changes in commit a0b8cab3 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") have introduced a
call to add_to_page_cache_lru() which causes a leak in nfs_symlink() as
now the page gets an extra refcount that is not dropped.

Jan Stancek observed and reported the leak effect while running test8
from Connectathon Testsuite.  After several iterations over the test
case, which creates several symlinks on a NFS mountpoint, the test
system was quickly getting into an out-of-memory scenario.

This patch fixes the page leak by dropping that extra refcount
add_to_page_cache_lru() is grabbing.
Signed-off-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NRafael Aquini <aquini@redhat.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: <stable@vger.kernel.org>	[3.11.x+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a0b54add

ocfs2: fix ocfs2_sync_file() if filesystem is readonly · a987c7ca

由 Younger Liu 提交于 2月 10, 2014

If filesystem is readonly, there is no need to flush drive's caches or
force any uncommitted transactions.

[akpm@linux-foundation.org: return -EROFS, not 0]
Signed-off-by: NYounger Liu <younger.liucn@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a987c7ca

fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem · 96c7a2ff

由 Eric W. Biederman 提交于 2月 10, 2014

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96c7a2ff

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功