提交 · bbddca8e8fac07ece3938e03526b5d00fa791a4c · openeuler / Kernel

09 1月, 2016 12 次提交

nfsd: don't hold i_mutex over userspace upcalls · bbddca8e

由 NeilBrown 提交于 1月 07, 2016

We need information about exports when crossing mountpoints during
lookup or NFSv4 readdir.  If we don't already have that information
cached, we may have to ask (and wait for) rpc.mountd.

In both cases we currently hold the i_mutex on the parent of the
directory we're asking rpc.mountd about.  We've seen situations where
rpc.mountd performs some operation on that directory that tries to take
the i_mutex again, resulting in deadlock.

With some care, we may be able to avoid that in rpc.mountd.  But it
seems better just to avoid holding a mutex while waiting on userspace.

It appears that lookup_one_len is pretty much the only operation that
needs the i_mutex.  So we could just drop the i_mutex elsewhere and do
something like

	mutex_lock()
	lookup_one_len()
	mutex_unlock()

In many cases though the lookup would have been cached and not required
the i_mutex, so it's more efficient to create a lookup_one_len() variant
that only takes the i_mutex when necessary.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bbddca8e

fs:affs:Replace time_t with time64_t · db39c167

由 DengChao 提交于 11月 12, 2015

The affs code uses "time_t" and "get_seconds()". This will cause
problems on 32-bit architectures in 2038 when time_t overflows.
This patch replaces them with "time64_t" and
"ktime_get_real_seconds()". This patch introduces expensive 64-bit
divsion in "secs_to_datestamp()", considering this function is not
called so often, the cost should be acceptable.
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDengChao <chao.deng@linaro.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

db39c167

fs/9p: use fscache mutex rather than spinlock · 8f5fed1e

由 Sasha Levin 提交于 1月 07, 2016

We may sleep inside a the lock, so use a mutex rather than spinlock.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8f5fed1e

proc: add a reschedule point in proc_readfd_common() · 3cc4a84e

由 Eric Dumazet 提交于 12月 03, 2015

User can pass an arbitrary large buffer to getdents().

It is typically a 32KB buffer used by libc scandir() implementation.

When scanning /proc/{pid}/fd, we can hold cpu way too long,
so add a cond_resched() to be kind with other tasks.

We've seen latencies of more than 50ms on real workloads.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3cc4a84e

logfs: constify logfs_block_ops structures · bc51b2a9

由 Julia Lawall 提交于 12月 11, 2015

The logfs_block_ops structures are never modified, so declare them as
const.

Done with the help of Coccinelle.
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bc51b2a9

fcntl: allow to set O_DIRECT flag on pipe · 0dbf5f20

由 Stanislav Kinsburskiy 提交于 12月 15, 2015

With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file
via sys_fcntl, because of unsupported sanity checks.
Ability to set this flag will be used by CRIU to migrate packetized pipes.

v2:
Fixed typos and mode variable to check.
Signed-off-by: NStanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0dbf5f20

fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE · 90330e68

由 Abhi Das 提交于 12月 18, 2015

During testing, I discovered that __generic_file_splice_read() returns
0 (EOF) when aops->readpage fails with AOP_TRUNCATED_PAGE on the first
page of a single/multi-page splice read operation. This EOF return code
causes the userspace test to (correctly) report a zero-length read error
when it was expecting otherwise.

The current strategy of returning a partial non-zero read when ->readpage
returns AOP_TRUNCATED_PAGE works only when the failed page is not the
first of the lot being processed.

This patch attempts to retry lookup and call ->readpage again on pages
that had previously failed with AOP_TRUNCATED_PAGE. With this patch, my
tests pass and I haven't noticed any unwanted side effects.

This version removes the thrice-retry loop and instead indefinitely
retries lookups on AOP_TRUNCATED_PAGE errors from ->readpage. This
behavior is now similar to do_generic_file_read().
Signed-off-by: NAbhi Das <adas@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

90330e68

fs: xattr: Use kvfree() · 0b2a6f23

由 Richard Weinberger 提交于 1月 02, 2016

... instead of open coding it.
Signed-off-by: NRichard Weinberger <richard@nod.at>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0b2a6f23

A
nbd: use ->compat_ioctl() · 263a3df1
由 Al Viro 提交于 1月 07, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
263a3df1

compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS) · a7f61e89

由 Jann Horn 提交于 1月 05, 2016

This replaces all code in fs/compat_ioctl.c that translated
ioctl arguments into a in-kernel structure, then performed
do_ioctl under set_fs(KERNEL_DS), with code that allocates
data on the user stack and can call the VFS ioctl handler
under USER_DS.

This is done as a hardening measure because the caller
does not know what kind of ioctl handler will be invoked,
only that no corresponding compat_ioctl handler exists and
what the ioctl command number is. The accidental
invocation of an unlocked_ioctl handler that unexpectedly
calls copy_to_user could be a severe security issue.
Signed-off-by: NJann Horn <jann@thejh.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a7f61e89

A
compat_ioctl: don't pass fd around when not needed · 66cf191f
由 Al Viro 提交于 1月 07, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
66cf191f

compat_ioctl: don't look up the fd twice · b4341721

由 Jann Horn 提交于 1月 05, 2016

In code in fs/compat_ioctl.c that translates ioctl arguments
into a in-kernel structure, then performs sys_ioctl, possibly
under set_fs(KERNEL_DS), this commit changes the sys_ioctl
calls to do_ioctl calls. do_ioctl is a new function that does
the same thing as sys_ioctl, but doesn't look up the fd again.

This change is made to avoid (potential) security issues
because of ioctl handlers that accept one of the ioctl
commands I2C_FUNCS, VIDEO_GET_EVENT, MTIOCPOS, MTIOCGET,
TIOCGSERIAL, TIOCSSERIAL, RTC_IRQP_READ, RTC_EPOCH_READ.
This can happen for multiple reasons:

 - The ioctl command number could be reused.
 - The ioctl handler might not check the full ioctl
   command. This is e.g. true for drm_ioctl.
 - The ioctl handler is very special, e.g. cuse_file_ioctl

The real issue is that set_fs(KERNEL_DS) is used here,
but that's fixed in a separate commit
"compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS)".

This change mitigates potential security issues by
preventing a race that permits invocation of
unlocked_ioctl handlers under KERNEL_DS through compat
code even if a corresponding compat_ioctl handler exists.

So far, no way has been identified to use this to damage
kernel memory without having CAP_SYS_ADMIN in the init ns
(with the capability, doing reads/writes at arbitrary
kernel addresses should be easy through CUSE's ioctl
handler with FUSE_IOCTL_UNRESTRICTED set).

[AV: two missed sys_ioctl() taken care of]
Signed-off-by: NJann Horn <jann@thejh.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b4341721

07 1月, 2016 2 次提交

fs: use block_device name vsprintf helper · a1c6f057

由 Dmitry Monakhov 提交于 4月 13, 2015

Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a1c6f057

fs: use gendisk->disk_name where possible · 424081f3

由 Dmitry Monakhov 提交于 4月 13, 2015

gendisk with part==0 is obviously gendisk->disk_name.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

424081f3

06 1月, 2016 1 次提交

poll: plug an unused argument to do_poll · ccec5ee3

由 Mateusz Guzik 提交于 1月 06, 2016

Number of fds is already known based on passed list.

No functional changes.
Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ccec5ee3

04 1月, 2016 7 次提交

HFS wants 8Kb per-superblock allocation; just use kmalloc() · 80f8dccf

由 Al Viro 提交于 1月 02, 2016

... rather than play with __get_free_pages() (and figuring out the
allocation order, etc.)
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

80f8dccf

jfs: microoptimize get_zeroed_page / virt_to_page · 76e8d7cb

由 Al Viro 提交于 1月 02, 2016

get_zeroed_page does alloc_page and returns page_address of the result;
subsequent virt_to_page will recover the page, but since the caller
needs both page and its page_address() anyway, why bother going through
that wrapper at all?
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

76e8d7cb

A
hpfs: missing endianness annotation · 4e728cf8
由 Al Viro 提交于 12月 29, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
4e728cf8
A
don't carry MAY_OPEN in op->acc_mode · 62fb4a15
由 Al Viro 提交于 12月 26, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
62fb4a15

saner calling conventions for copy_mount_options() · b40ef869

由 Al Viro 提交于 12月 14, 2015

let it just return NULL, pointer to kernel copy or ERR_PTR().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b40ef869

A
proc_pid_attr_write(): switch to memdup_user() · bb646cdb
由 Al Viro 提交于 12月 24, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
bb646cdb

convert a bunch of open-coded instances of memdup_user_nul() · 16e5c1fc

由 Al Viro 提交于 12月 24, 2015

A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

16e5c1fc

30 12月, 2015 3 次提交

ocfs2/dlm: clear migration_pending when migration target goes down · cc28d6d8

由 xuejiufei 提交于 12月 29, 2015

We have found a BUG on res->migration_pending when migrating lock
resources.  The situation is as follows.

dlm_mark_lockres_migration
  res->migration_pending = 1;
  __dlm_lockres_reserve_ast
  dlm_lockres_release_ast returns with res->migration_pending remains
      because other threads reserve asts
  wait dlm_migration_can_proceed returns 1
  >>>>>>> o2hb found that target goes down and remove target
          from domain_map
  dlm_migration_can_proceed returns 1
  dlm_mark_lockres_migrating returns -ESHOTDOWN with
      res->migration_pending still remains.

When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending.  So clear migration_pending when target is
down.
Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc28d6d8

ocfs2: fix flock panic issue · b5a8bc33

由 Junxiao Bi 提交于 12月 29, 2015

Commit 4f656367 ("Move locks API users to locks_lock_inode_wait()")
move flock/posix lock indentify code to locks_lock_inode_wait(), but
missed to set fl_flags to FL_FLOCK which caused the following kernel
panic on 4.4.0_rc5.

  kernel BUG at fs/locks.c:1895!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2(O) ocfs2_dlmfs(O) ocfs2_stack_o2cb(O) ocfs2_dlm(O) ocfs2_nodemanager(O) ocfs2_stackglue(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
  CPU: 0 PID: 20268 Comm: flock_unit_test Tainted: G           O    4.4.0-rc5-next-20151217 #1
  Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
  task: ffff88007b3672c0 ti: ffff880028b58000 task.ti: ffff880028b58000
  RIP: locks_lock_inode_wait+0x2e/0x160
  Call Trace:
    ocfs2_do_flock+0x91/0x160 [ocfs2]
    ocfs2_flock+0x76/0xd0 [ocfs2]
    SyS_flock+0x10f/0x1a0
    entry_SYSCALL_64_fastpath+0x12/0x71
  Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 48 81 ec 88 00 00 00 8b 46 40 83 e0 03 83 f8 01 0f 84 ad 00 00 00 83 f8 02 74 04 <0f> 0b eb fe 4c 8d ad 60 ff ff ff 4c 8d 7b 58 e8 0e 8e 73 00 4d
  RIP  locks_lock_inode_wait+0x2e/0x160
   RSP <ffff880028b5bce8>
  ---[ end trace dfca74ec9b5b274c ]---

Fixes: 4f656367 ("Move locks API users to locks_lock_inode_wait()")
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b5a8bc33

ocfs2: fix BUG when calculate new backup super · 5c9ee4cb

由 Joseph Qi 提交于 12月 29, 2015

When resizing, it firstly extends the last gd.  Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.

But it currently doesn't consider the situation that the backup super is
already done.  And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

So check whether the backup super is done and then do the updates.
Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
Reviewed-by: NJiufei Xue <xuejiufei@huawei.com>
Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c9ee4cb

23 12月, 2015 1 次提交
- A
  new helpers: no_seek_end_llseek{,_size}() · b25472f9
  由 Al Viro 提交于 12月 05, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  b25472f9
19 12月, 2015 1 次提交

proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter · 41a0c249

由 Colin Ian King 提交于 12月 18, 2015

Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
the setting of ret after the get_proc_task call and incorrectly left it as
-ESRCH.  Instead, return 0 when successful.

Example breakage:

  echo 0 > /proc/self/coredump_filter
  bash: echo: write error: No such process

Fixes: 774636e1 ("proc: convert to kstrto*()/kstrto*_from_user()")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41a0c249

17 12月, 2015 1 次提交

nfsd: don't hold ls_mutex across a layout recall · be20aa00

由 Jeff Layton 提交于 11月 29, 2015

We do need to serialize layout stateid morphing operations, but we
currently hold the ls_mutex across a layout recall which is pretty
ugly. It's also unnecessary -- once we've bumped the seqid and
copied it, we don't need to serialize the rest of the CB_LAYOUTRECALL
vs. anything else. Just drop the mutex once the copy is done.

This was causing a "workqueue leaked lock or atomic" warning and an
occasional deadlock.

There's more work to be done here but this fixes the immediate
regression.

Fixes: cc8a5532 "nfsd: serialize layout stateid morphing operations"
Cc: stable@vger.kernel.org
Reported-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

be20aa00

16 12月, 2015 2 次提交

Btrfs: check prepare_uptodate_page() error code earlier · bb1591b4

由 Chris Mason 提交于 12月 14, 2015

prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page.  But if the first call returns an error,
our page will be unlocked and its not safe to call it again.

This bug goes all the way back to 2011, and it's not something commonly
hit.

While we're here, add a more explicit check for the page being truncated
away.  The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.
Reported-by: NDave Jones <dsj@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

bb1591b4

Btrfs: check for empty bitmap list in setup_cluster_bitmaps · 1b9b922a

由 Chris Mason 提交于 12月 15, 2015

Dave Jones found a warning from kasan in setup_cluster_bitmaps()

==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping:          (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G        W
4.4.0-rc3-backup-debug+ #1
 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
 [<ffffffffa680a43e>] dump_stack+0x4b/0x6d
 [<ffffffffa62638d1>] kasan_report_error+0x501/0x520
 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
 [<ffffffffa6263948>] kasan_report+0x58/0x60
 [<ffffffffa6814b00>] ? rb_last+0x10/0x40
 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70
 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520

Andrey noticed this was because we were doing list_first_entry on a list
that might be empty.  Rework the tests a bit so we don't do that.
Signed-off-by: NChris Mason <clm@fb.com>
Reprorted-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: NDave Jones <dsj@fb.com>

1b9b922a

14 12月, 2015 1 次提交

sched/wait: Fix the signal handling fix · dfd01f02

由 Peter Zijlstra 提交于 12月 13, 2015

Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633 ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: NJan Stancek <jstancek@redhat.com>
Reported-by: NChris Mason <clm@fb.com>
Tested-by: NJan Stancek <jstancek@redhat.com>
Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
Tested-by: NChris Mason <clm@fb.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfd01f02

13 12月, 2015 2 次提交

ocfs2: fix SGID not inherited issue · 854ee2e9

由 Junxiao Bi 提交于 12月 11, 2015

Commit 8f1eb487 ("ocfs2: fix umask ignored issue") introduced an
issue, SGID of sub dir was not inherited from its parents dir.  It is
because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
is overwritten by "mode" which don't have SGID set later.

Fixes: 8f1eb487 ("ocfs2: fix umask ignored issue")
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

854ee2e9

osd fs: __r4w_get_page rely on PageUptodate for uptodate · 3066a967

由 Hugh Dickins 提交于 12月 11, 2015

Commit 42cb14b1 ("mm: migrate dirty page without
clear_page_dirty_for_io etc") simplified the migration of a PageDirty
pagecache page: one stat needs moving from zone to zone and that's about
all.

It's convenient and safest for it to shift the PageDirty bit from old
page to new, just before updating the zone stats: before copying data
and marking the new PageUptodate.  This is all done while both pages are
isolated and locked, just as before; and just as before, there's a
moment when the new page is visible in the radix_tree, but not yet
PageUptodate.  What's new is that it may now be briefly visible as
PageDirty before it is PageUptodate.

When I scoured the tree to see if this could cause a problem anywhere,
the only places I found were in two similar functions __r4w_get_page():
which look up a page with find_get_page() (not using page lock), then
claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.

I'm not sure whether that was right before, but now it might be wrong
(on rare occasions): only claim the page is uptodate if PageUptodate.
Or perhaps the page in question could never be migratable anyway?
Signed-off-by: NHugh Dickins <hughd@google.com>
Tested-by: NBoaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3066a967

10 12月, 2015 3 次提交

btrfs: fix misleading warning when space cache failed to load · 94356889

由 Holger Hoffstätte 提交于 11月 27, 2015

When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.
Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: NFilipe Manana <fdmanana@suse.com>

94356889

Btrfs: fix transaction handle leak in balance · 8a7d656f

由 Filipe Manana 提交于 12月 10, 2015

If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.

Fixes: 2c9fe835 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>

8a7d656f

Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013

由 Filipe Manana 提交于 11月 27, 2015

As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

 10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 10264  {
 (...)
 10405                  if (trimming) {
 10406                          WARN_ON(!list_empty(&block_group->bg_list));
 10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
 10408                          list_move(&block_group->bg_list,
 10409                                    &trans->transaction->deleted_bgs);
 10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
 10411                          btrfs_get_block_group(block_group);
 10412                  }
 (...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

            CPU 1                                     CPU 2

 cleaner_kthread()
   btrfs_delete_unused_bgs()

     sees bg X in list
      fs_info->unused_bgs

     deletes bg X from list
      fs_info->unused_bgs

                                            scrub_enumerate_chunks()

                                              searches device tree using
                                              its commit root

                                              finds device extent for
                                              block group X

                                              gets block group X from the tree
                                              fs_info->block_group_cache_tree
                                              (via btrfs_lookup_block_group())

                                              sets bg X to RO (again)

                                              scrub_chunk(bg X)

                                              sets bg X back to RW mode

                                              adds bg X to the list
                                              fs_info->unused_bgs again,
                                              since it's still unused and
                                              currently not in that list

     sets bg X to RO mode

     btrfs_remove_chunk(bg X)

     --> discard is enabled and bg X
         is in the fs_info->unused_bgs
         list again so the warning is
         triggered
     --> we move it from that list into
         the transaction's delete_bgs
         list, but we can have another
         task currently manipulating
         the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
Signed-off-by: NFilipe Manana <fdmanana@suse.com>

348a0013

09 12月, 2015 2 次提交

fix the regression from "direct-io: Fix negative return from dio read beyond eof" · 2d4594ac

由 Al Viro 提交于 12月 08, 2015

Sure, it's better to bail out of past-the-eof read and return 0 than return
a bogus negative value on such.  Only we'd better make sure we are bailing out
with 0 and not -ENOMEM...

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2d4594ac

9p: ->evict_inode() should kick out ->i_data, not ->i_mapping · 4ad78628

由 Al Viro 提交于 12月 08, 2015

For block devices the pagecache is associated with the inode
on bdevfs, not with the aliasing ones on the mountable filesystems.
The latter have its own ->i_data empty and ->i_mapping pointing
to the (unique per major/minor) bdevfs inode.  That guarantees
cache coherence between all block device inodes with the same
device number.

Eviction of an alias inode has no business trying to evict the
pages belonging to bdevfs one; moreover, ->i_mapping is only
safe to access when the thing is opened.  At the time of
->evict_inode() the victim is definitely *not* opened.  We are
about to kill the address space embedded into struct inode
(inode->i_data) and that's what we need to empty of any pages.

9p instance tries to empty inode->i_mapping instead, which is
both unsafe and bogus - if we have several device nodes with
the same device number in different places, closing one of them
should not try to empty the (shared) page cache.

Fortunately, other instances in the tree are OK; they are
evicting from &inode->i_data instead, as 9p one should.

Cc: stable@vger.kernel.org # v2.6.32+, ones prior to 2.6.36 need only half of that
Reported-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Tested-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4ad78628

08 12月, 2015 1 次提交

SUNRPC: Fix callback channel · 756b9b37

由 Trond Myklebust 提交于 12月 07, 2015

The NFSv4.1 callback channel is currently broken because the receive
message will keep shrinking because the backchannel receive buffer size
never gets reset.
The easiest solution to this problem is instead of changing the receive
buffer, to rather adjust the copied request.

Fixes: 38b7631f ("nfs4: limit callback decoding to received bytes")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

756b9b37

07 12月, 2015 1 次提交

restore_nameidata(): no need to clear now->stack · e1a63bbc

由 Al Viro 提交于 12月 05, 2015

microoptimization: in all callers *now is in the frame we are about to leave.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e1a63bbc

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功