提交 · 22ab9014bfdaf97449759aef946871aabe78bb40 · openeuler / raspberrypi-kernel

04 7月, 2013 3 次提交

fs/ocfs2/dlm/dlmrecovery.c:dlm_request_all_locks(): ret should be int instead of enum · 22ab9014

由 Joseph Qi 提交于 7月 03, 2013

In dlm_request_all_locks, ret is type enum.  But o2net_send_message
returns a type int value.  Then it will never run into the following
error branch.  So we should change the ret type from enum to int.
Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: NSunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

22ab9014

fs/ocfs2/dlm/dlmrecovery.c: remove duplicate declarations · 82d627cf

由 Joseph Qi 提交于 7月 03, 2013

Below 3 functions have already been declared in dlmcommon.h, so we have
no need to declare them again in dlmrecovery.c:

  dlm_complete_recovery_thread
  dlm_launch_recovery_thread
  dlm_kick_recovery_thread
Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: NSunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

82d627cf

configfs: use capped length for ->store_attribute() · 7121064b

由 Dan Carpenter 提交于 7月 03, 2013

The difference between "count" and "len" is that "len" is capped at
4095.  Changing it like this makes it match how sysfs_write_file() is
implemented.

This is a static analysis patch.  I haven't found any store_attribute()
functions where this change makes a difference.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NJoel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7121064b

03 7月, 2013 3 次提交

ext4: ->tmpfile() support · af51a2ac

由 Al Viro 提交于 6月 29, 2013

very similar to ext3 counterpart...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

af51a2ac

vfs: export lseek_execute() to modules · 46a1c2c7

由 Jie Liu 提交于 6月 25, 2013

For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: NJie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

46a1c2c7

sync: don't block the flusher thread waiting on IO · 7747bd4b

由 Dave Chinner 提交于 7月 02, 2013

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7747bd4b

02 7月, 2013 7 次提交

f2fs: fix to recover i_size from roll-forward · a1dd3c13

由 Jaegeuk Kim 提交于 6月 27, 2013

If user requests many data writes and fsync together, the last updated i_size
should be stored to the inode block consistently.

But, previous write_end just marks the inode as dirty and doesn't update its
metadata into its inode block.
After that, fsync just writes the inode block with newly updated data index
excluding inode metadata updates.

So, this patch introduces write_end in which updates inode block too when the
i_size is changed.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

a1dd3c13

f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes() · 5ebefc5b

由 Gu Zheng 提交于 6月 27, 2013

As destroy_fsync_dnodes() is a simple list-cleanup func, so delete the unused
and unrelated f2fs_sb_info argument of it.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

5ebefc5b

f2fs: remove reusing any prefree segments · 763bfe1b

由 Jaegeuk Kim 提交于 6月 27, 2013

This patch removes check_prefree_segments initially designed to enhance the
performance by narrowing the range of LBA usage across the whole block device.

When allocating a new segment, previous f2fs tries to find proper prefree
segments, and then, if finds a segment, it reuses the segment for further
data or node block allocation.

However, I found that this was totally wrong approach since the prefree segments
have several data or node blocks that will be used by the roll-forward mechanism
operated after sudden-power-off.

Let's assume the following scenario.

/* write 8MB with fsync */
for (i = 0; i < 2048; i++) {
	offset = i * 4096;
	write(fd, offset, 4KB);
	fsync(fd);
}

In this case, naive segment allocation sequence will be like:
 data segment: x, x+1, x+2, x+3
 node segment: y, y+1, y+2, y+3.

But, if we can reuse prefree segments, the sequence can be like:
 data segment: x, x+1, y, y+1
 node segment: y, y+1, y+2, y+3.
Because, y, y+1, and y+2 became prefree segments one by one, and those are
reused by data allocation.

After conducting this workload, we should consider how to recover the latest
inode with its data.
If we reuse the prefree segments such as y or y+1, we lost the old node blocks
so that f2fs even cannot start roll-forward recovery.

Therefore, I suggest that we should remove reusing prefree segments.
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

763bfe1b

f2fs: code cleanup and simplify in func {find/add}_gc_inode · 6cc4af56

由 Gu Zheng 提交于 6月 20, 2013

This patch simplifies list operations in find_gc_inode and add_gc_inode.
Just simple code cleanup.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: add description]
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

6cc4af56

f2fs: optimize the init_dirty_segmap function · 8736fbf0

由 Namjae Jeon 提交于 6月 16, 2013

Optimize the while loop condition

Since this condition will always be true and while loop will
be terminated by the following condition in code:

if (segno >= TOTAL_SEGS(sbi))
    break;
Hence we can replace the while loop condition with while(1)
instead of always checking for segno to be less than Total segs.

Also we do not need to use TOTAL_SEGS() everytime. We can store
this value in a local variable since this value is constant.
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: NPankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

8736fbf0

f2fs: fix an endian conversion bug detected by sparse · 060dd67b

由 Jaegeuk Kim 提交于 6月 24, 2013

This patch should fix the following bug reported by kbuild test robot.

fs/f2fs/recovery.c:233:33: sparse: incorrect type in assignment
(different base types)

parse warnings: (new ones prefixed by >>)

>> recovery.c:233: sparse: incorrect type in assignment (different base types)
   recovery.c:233:    expected unsigned int [unsigned] [assigned] ofs_in_node
   recovery.c:233:    got restricted __le16 [assigned] [usertype] ofs_in_node
>> recovery.c:238: sparse: incorrect type in assignment (different base types)
   recovery.c:238:    expected unsigned int [unsigned] ofs_in_node
   recovery.c:238:    got restricted __le16 [assigned] [usertype] ofs_in_node
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

060dd67b

f2fs: fix crc endian conversion · 7e586fa0

由 Jaegeuk Kim 提交于 6月 19, 2013

While calculating CRC for the checkpoint block, we use __u32, but when storing
the crc value to the disk, we use __le32.

Let's fix the inconsistency.
Reported-and-Tested-by: NOded Gabbay <ogabbay@advaoptical.com>
Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>

7e586fa0

01 7月, 2013 16 次提交

ext4: optimize starting extent in ext4_ext_rm_leaf() · 6ae06ff5

由 Ashish Sangwan 提交于 7月 01, 2013

Both hole punch and truncate use ext4_ext_rm_leaf() for removing
blocks.  Currently we choose the last extent as the starting
point for removing blocks:

	ex = EXT_LAST_EXTENT(eh);

This is OK for truncate but for hole punch we can optimize the extent
selection as the path is already initialized.  We could use this
information to select proper starting extent.  The code change in this
patch will not affect truncate as for truncate path[depth].p_ext will
always be NULL.
Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

6ae06ff5

jbd2: invalidate handle if jbd2_journal_restart() fails · 41a5b913

由 Theodore Ts'o 提交于 7月 01, 2013

If jbd2_journal_restart() fails the handle will have been disconnected
from the current transaction.  In this situation, the handle must not
be used for for any jbd2 function other than jbd2_journal_stop().
Enforce this with by treating a handle which has a NULL transaction
pointer as an aborted handle, and issue a kernel warning if
jbd2_journal_extent(), jbd2_journal_get_write_access(),
jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.

This commit also fixes a bug where jbd2_journal_stop() would trip over
a kernel jbd2 assertion check when trying to free an invalid handle.

Also move the responsibility of setting current->journal_info to
start_this_handle(), simplifying the three users of this function.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reported-by: NYounger Liu <younger.liu@huawei.com>
Cc: Jan Kara <jack@suse.cz>

41a5b913

ext4: translate flag bits to strings in tracepoints · 21ddd568

由 Theodore Ts'o 提交于 7月 01, 2013

Translate the bitfields used in various flags argument to strings to
make the tracepoint output more human-readable.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

21ddd568

ext4: fix up error handling for mpage_map_and_submit_extent() · cb530541

由 Theodore Ts'o 提交于 7月 01, 2013

The function mpage_released_unused_page() must only be called once;
otherwise the kernel will BUG() when the second call to
mpage_released_unused_page() tries to unlock the pages which had been
unlocked by the first call.

Also restructure the error handling so that we only give up on writing
the dirty pages in the case of ENOSPC where retrying the allocation
won't help.  Otherwise, a transient failure, such as a kmalloc()
failure in calling ext4_map_blocks() might cause us to give up on
those pages, leading to a scary message in /var/log/messages plus data
loss.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

cb530541

jbd2: fix theoretical race in jbd2__journal_restart · 39c04153

由 Theodore Ts'o 提交于 7月 01, 2013

Once we decrement transaction->t_updates, if this is the last handle
holding the transaction from closing, and once we release the
t_handle_lock spinlock, it's possible for the transaction to commit
and be released.  In practice with normal kernels, this probably won't
happen, since the commit happens in a separate kernel thread and it's
unlikely this could all happen within the space of a few CPU cycles.

On the other hand, with a real-time kernel, this could potentially
happen, so save the tid found in transaction->t_tid before we release
t_handle_lock.  It would require an insane configuration, such as one
where the jbd2 thread was set to a very high real-time priority,
perhaps because a high priority real-time thread is trying to read or
write to a file system.  But some people who use real-time kernels
have been known to do insane things, including controlling
laser-wielding industrial robots.  :-)
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

39c04153

ext4: only zero partial blocks in ext4_zero_partial_blocks() · e1be3a92

由 Lukas Czerner 提交于 7月 01, 2013

Currently if we pass range into ext4_zero_partial_blocks() which covers
entire block we would attempt to zero it even though we should only zero
unaligned part of the block.

Fix this by checking whether the range covers the whole block skip
zeroing if so.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

e1be3a92

ext4: check error return from ext4_write_inline_data_end() · 42c832de

由 Theodore Ts'o 提交于 7月 01, 2013

The function ext4_write_inline_data_end() can return an error.  So we
need to assign it to a signed integer variable to check for an error
return (since copied is an unsigned int).
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org

42c832de

ext4: delete unnecessary C statements · 353eefd3

由 jon ernst 提交于 7月 01, 2013

Comparing unsigned variable with 0 always returns false.
err = 0 is duplicated and unnecessary.

[ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]
Signed-off-by: N"Jon Ernst" <jonernst07@gmx.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

353eefd3

ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() · 64cb9273

由 Al Viro 提交于 7月 01, 2013

Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
in-core rbtree for use by call_filldir().  All updates of ->f_pos are
done by the latter; bumping it here (on error) is obviously wrong - we
might very well have it nowhere near the block we'd found an error in.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

64cb9273

jbd2: move superblock checksum calculation to jbd2_write_superblock() · fe52d17c

由 Theodore Ts'o 提交于 7月 01, 2013

Some of the functions which modify the jbd2 superblock were not
updating the checksum before calling jbd2_write_superblock().  Move
the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
that the checksum is calculated consistently.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: stable@vger.kernel.org

fe52d17c

ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a

由 Ashish Sangwan 提交于 7月 01, 2013

No need to pass file pointer when we can directly pass inode pointer.
Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

aeb2817a

ext4: improve free space calculation for inline_data · c4932dbe

由 boxi liu 提交于 7月 01, 2013

In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.
Signed-off-by: Nliulei <lewis.liulei@huawei.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NTao Ma <boyu.mt@taobao.com>

c4932dbe

ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e

由 Joe Perches 提交于 7月 01, 2013

Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

e7c96e8e

ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77

由 Zheng Liu 提交于 7月 01, 2013

Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.
Reported-by: NDave Hansen <dave.hansen@intel.com>
Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

d3922a77

ext4: implement error handling of ext4_mb_new_preallocation() · 2c00ef3e

由 Alexey Khoroshilov 提交于 7月 01, 2013

If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
  ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
  ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
  of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
      *(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
  structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]
Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>

2c00ef3e

ext4: fix corruption when online resizing a fs with 1K block size · 6ca792ed

由 Maarten ter Huurne 提交于 7月 01, 2013

Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.
Signed-off-by: NMaarten ter Huurne <maarten@treewalker.org>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
CC: stable@vger.kernel.org

6ca792ed

29 6月, 2013 11 次提交

A
lseek_execute() doesn't need an inode passed to it · 2142914e
由 Al Viro 提交于 6月 23, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2142914e
A
block_dev: switch to fixed_size_llseek() · 5d48f3a2
由 Al Viro 提交于 6月 23, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
5d48f3a2

locks: give the blocked_hash its own spinlock · 7b2296af

由 Jeff Layton 提交于 6月 21, 2013

There's no reason we have to protect the blocked_hash and file_lock_list
with the same spinlock. With the tests I have, breaking it in two gives
a barely measurable performance benefit, but it seems reasonable to make
this locking as granular as possible.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7b2296af

locks: add a new "lm_owner_key" lock operation · 3999e493

由 Jeff Layton 提交于 6月 21, 2013

Currently, the hashing that the locking code uses to add these values
to the blocked_hash is simply calculated using fl_owner field. That's
valid in most cases except for server-side lockd, which validates the
owner of a lock based on fl_owner and fl_pid.

In the case where you have a small number of NFS clients doing a lot
of locking between different processes, you could end up with all
the blocked requests sitting in a very small number of hash buckets.

Add a new lm_owner_key operation to the lock_manager_operations that
will generate an unsigned long to use as the key in the hashtable.
That function is only implemented for server-side lockd, and simply
XORs the fl_owner and fl_pid.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3999e493

locks: turn the blocked_list into a hashtable · 48f74186

由 Jeff Layton 提交于 6月 21, 2013

Break up the blocked_list into a hashtable, using the fl_owner as a key.
This speeds up searching the hash chains, which is especially significant
for deadlock detection.

Note that the initial implementation assumes that hashing on fl_owner is
sufficient. In most cases it should be, with the notable exception being
server-side lockd, which compares ownership using a tuple of the
nlm_host and the pid sent in the lock request. So, this may degrade to a
single hash bucket when you only have a single NFS client. That will be
addressed in a later patch.

The careful observer may note that this patch leaves the file_lock_list
alone. There's much less of a case for turning the file_lock_list into a
hashtable. The only user of that list is the code that generates
/proc/locks, and it always walks the entire list.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

48f74186

locks: convert fl_link to a hlist_node · 139ca04e

由 Jeff Layton 提交于 6月 21, 2013

Testing has shown that iterating over the blocked_list for deadlock
detection turns out to be a bottleneck. In order to alleviate that,
begin the process of turning it into a hashtable. We start by turning
the fl_link into a hlist_node and the global lists into hlists. A later
patch will do the conversion of the blocked_list to a hashtable.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

139ca04e

locks: avoid taking global lock if possible when waking up blocked waiters · 4e8c765d

由 Jeff Layton 提交于 6月 21, 2013

Since we always hold the i_lock when inserting a new waiter onto the
fl_block list, we can avoid taking the global lock at all if we find
that it's empty when we go to wake up blocked waiters.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4e8c765d

locks: protect most of the file_lock handling with i_lock · 1c8c601a

由 Jeff Layton 提交于 6月 21, 2013

Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.

->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1c8c601a

locks: encapsulate the fl_link list handling · 88974691

由 Jeff Layton 提交于 6月 21, 2013

Move the fl_link list handling routines into a separate set of helpers.
Also ensure that locks and requests are always put on global lists
last (after fully initializing them) and are taken off before unintializing
them.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

88974691

locks: make "added" in __posix_lock_file a bool · b9746ef8

由 Jeff Layton 提交于 6月 21, 2013

Signed-off-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b9746ef8

locks: comment cleanups and clarifications · 1cb36012

由 Jeff Layton 提交于 6月 21, 2013

Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1cb36012