提交 · 0eefb10758e696616f19a84d8c5f15b9ffc0dccd · OpenHarmony / kernel_linux

22 6月, 2017 10 次提交

ext4: extended attribute value size limit is enforced by vfs · 0eefb107

由 Tahsin Erdogan 提交于 6月 21, 2017

EXT4_XATTR_MAX_LARGE_EA_SIZE definition in ext4 is currently unused.
Besides, vfs enforces its own 64k limit which makes the 1MB limit in
ext4 redundant. Remove it.
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

0eefb107

ext4: fix ref counting for ea_inode · 1e7d359d

由 Tahsin Erdogan 提交于 6月 21, 2017

The ref count on ea_inode is incremented by
ext4_xattr_inode_orphan_add() which is supposed to be decremented by
ext4_xattr_inode_array_free(). The decrement is conditioned on whether
the ea_inode is currently on the orphan list. However, the orphan list
addition only happens when journaling is enabled. In non-journaled case,r
we fail to release the ref count causing an error message like below.

"VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.
Have a nice day..."
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

1e7d359d

ext4: call journal revoke when freeing ea_inode blocks · ddfa17e4

由 Tahsin Erdogan 提交于 6月 21, 2017

ea_inode contents are treated as metadata, that's why it is journaled
during initial writes. Failing to call revoke during freeing could cause
user data to be overwritten with original ea_inode contents during journal
replay.
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

ddfa17e4

ext4: ea_inode owner should be the same as the inode owner · 9e1ba001

由 Tahsin Erdogan 提交于 6月 21, 2017

Quota charging is based on the ownership of the inode. Currently, the
xattr inode owner is set to the caller which may be different from the
parent inode owner. This is inconsistent with how quota is charged for
xattr block and regular data block writes.
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

9e1ba001

ext4: attach jinode after creation of xattr inode · bd3b963b

由 Tahsin Erdogan 提交于 6月 21, 2017

In data=ordered mode jinode needs to be attached to the xattr inode when
writing data to it. Attachment normally occurs during file open for regular
files. Since we are not using file interface to write to the xattr inode,
the jinode attach needs to be done manually.

Otherwise the following crash occurs in data=ordered mode.

 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: jbd2_journal_file_inode+0x37/0x110
 PGD 13b3c0067
 P4D 13b3c0067
 PUD 137660067
 PMD 0

 Oops: 0000 [#1] SMP
 CPU: 3 PID: 1877 Comm: python Not tainted 4.12.0-rc1+ #749
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 task: ffff88010e368980 task.stack: ffffc90000374000
 RIP: 0010:jbd2_journal_file_inode+0x37/0x110
 RSP: 0018:ffffc90000377980 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff880123b06230 RCX: 0000000000280000
 RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88012c8585d0
 RBP: ffffc900003779b0 R08: 0000000000000202 R09: 0000000000000001
 R10: 0000000000000000 R11: 0000000000000400 R12: ffff8801111f81c0
 R13: ffff88013b2b6800 R14: ffffc90000377ab0 R15: 0000000000000001
 FS:  00007f0c99b77740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000136d91000 CR4: 00000000000006e0
 Call Trace:
  jbd2_journal_inode_add_write+0xe/0x10
  ext4_map_blocks+0x59e/0x620
  ext4_xattr_set_entry+0x501/0x7d0
  ext4_xattr_block_set+0x1b2/0x9b0
  ext4_xattr_set_handle+0x322/0x4f0
  ext4_xattr_set+0x144/0x1a0
  ext4_xattr_user_set+0x34/0x40
  __vfs_setxattr+0x66/0x80
  __vfs_setxattr_noperm+0x69/0x1c0
  vfs_setxattr+0xa2/0xb0
  setxattr+0x12e/0x150
  path_setxattr+0x87/0xb0
  SyS_setxattr+0xf/0x20
  entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

bd3b963b

ext4: do not set posix acls on xattr inodes · 1b917ed8

由 Tahsin Erdogan 提交于 6月 21, 2017

We don't need acls on xattr inodes because they are not directly
accessible from user mode.

Besides lockdep complains about recursive locking of xattr_sem as seen
below.

  =============================================
  [ INFO: possible recursive locking detected ]
  4.11.0-rc8+ #402 Not tainted
  ---------------------------------------------
  python/1894 is trying to acquire lock:
   (&ei->xattr_sem){++++..}, at: [<ffffffff804878a6>] ext4_xattr_get+0x66/0x270

  but task is already holding lock:
   (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&ei->xattr_sem);
    lock(&ei->xattr_sem);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  3 locks held by python/1894:
   #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d829f>] mnt_want_write+0x1f/0x50
   #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803dda27>] vfs_setxattr+0x57/0xb0
   #2:  (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0

  stack backtrace:
  CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   dump_stack+0x67/0x99
   __lock_acquire+0x5f3/0x1830
   lock_acquire+0xb5/0x1d0
   down_read+0x2f/0x60
   ext4_xattr_get+0x66/0x270
   ext4_get_acl+0x43/0x1e0
   get_acl+0x72/0xf0
   posix_acl_create+0x5e/0x170
   ext4_init_acl+0x21/0xc0
   __ext4_new_inode+0xffd/0x16b0
   ext4_xattr_set_entry+0x5ea/0xb70
   ext4_xattr_block_set+0x1b5/0x970
   ext4_xattr_set_handle+0x351/0x5d0
   ext4_xattr_set+0x124/0x180
   ext4_xattr_user_set+0x34/0x40
   __vfs_setxattr+0x66/0x80
   __vfs_setxattr_noperm+0x69/0x1c0
   vfs_setxattr+0xa2/0xb0
   setxattr+0x129/0x160
   path_setxattr+0x87/0xb0
   SyS_setxattr+0xf/0x20
   entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

1b917ed8

ext4: lock inode before calling ext4_orphan_add() · 0de5983d

由 Tahsin Erdogan 提交于 6月 21, 2017

ext4_orphan_add() requires caller to be holding the inode lock.
Add missing lock statements.

 WARNING: CPU: 3 PID: 1806 at fs/ext4/namei.c:2731 ext4_orphan_add+0x4e/0x240
 CPU: 3 PID: 1806 Comm: python Not tainted 4.12.0-rc1+ #746
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 task: ffff880135d466c0 task.stack: ffffc900014b0000
 RIP: 0010:ext4_orphan_add+0x4e/0x240
 RSP: 0018:ffffc900014b3d50 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff8801348fe1f0 RCX: ffffc900014b3c64
 RDX: 0000000000000000 RSI: ffff8801348fe1f0 RDI: ffff8801348fe1f0
 RBP: ffffc900014b3da0 R08: 0000000000000000 R09: ffffffff80e82025
 R10: 0000000000004692 R11: 000000000000468d R12: ffff880137598000
 R13: ffff880137217000 R14: ffff880134ac58d0 R15: 0000000000000000
 FS:  00007fc50f09e740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000008bc2e0 CR3: 00000001375ac000 CR4: 00000000000006e0
 Call Trace:
  ext4_xattr_inode_orphan_add.constprop.19+0x9d/0xf0
  ext4_xattr_delete_inode+0x1c4/0x2f0
  ext4_evict_inode+0x15a/0x7f0
  evict+0xc0/0x1a0
  iput+0x16a/0x270
  do_unlinkat+0x172/0x290
  SyS_unlink+0x11/0x20
  entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

0de5983d

ext4: fix lockdep warning about recursive inode locking · 33d201e0

由 Tahsin Erdogan 提交于 6月 21, 2017

Setting a large xattr value may require writing the attribute contents
to an external inode. In this case we may need to lock the xattr inode
along with the parent inode. This doesn't pose a deadlock risk because
xattr inodes are not directly visible to the user and their access is
restricted.

Assign a lockdep subclass to xattr inode's lock.

 ============================================
 WARNING: possible recursive locking detected
 4.12.0-rc1+ #740 Not tainted
 --------------------------------------------
 python/1822 is trying to acquire lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0

 but task is already holding lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sb->s_type->i_mutex_key#15);
   lock(&sb->s_type->i_mutex_key#15);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by python/1822:
  #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
  #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
  #2:  (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
  #3:  (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0

 stack backtrace:
 CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 Call Trace:
  dump_stack+0x67/0x9e
  __lock_acquire+0x5f3/0x1750
  lock_acquire+0xb5/0x1d0
  down_write+0x2c/0x60
  ext4_xattr_set_entry+0x65a/0x7b0
  ext4_xattr_block_set+0x1b2/0x9b0
  ext4_xattr_set_handle+0x322/0x4f0
  ext4_xattr_set+0x144/0x1a0
  ext4_xattr_user_set+0x34/0x40
  __vfs_setxattr+0x66/0x80
  __vfs_setxattr_noperm+0x69/0x1c0
  vfs_setxattr+0xa2/0xb0
  setxattr+0x12e/0x150
  path_setxattr+0x87/0xb0
  SyS_setxattr+0xf/0x20
  entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

33d201e0

ext4: xattr-in-inode support · e50e5129

由 Andreas Dilger 提交于 6月 21, 2017

Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
        __u8    e_name_len;     /* length of name */
        __u8    e_name_index;   /* attribute name index */
        __le16  e_value_offs;   /* offset in disk block of value */
        __le32  e_value_inum;   /* inode in which value is stored */
        __le32  e_value_size;   /* size of attribute value */
        __le32  e_hash;         /* hash value of name and value */
        char    e_name[0];      /* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424Signed-off-by: NKalpak Shah <kalpak.shah@sun.com>
Signed-off-by: NJames Simmons <uja.ornl@gmail.com>
Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com>
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>

e50e5129

ext4: add largedir feature · e08ac99f

由 Artem Blagodarenko 提交于 6月 21, 2017

This INCOMPAT_LARGEDIR feature allows larger directories to be created
in ldiskfs, both with directory sizes over 2GB and and a maximum htree
depth of 3 instead of the current limit of 2. These features are needed
in order to exceed the current limit of approximately 10M entries in a
single directory.

This patch was originally written by Yang Sheng to support the Lustre server.

[ Bumped the credits needed to update an indexed directory -- tytso ]
Signed-off-by: NLiang Zhen <liang.zhen@intel.com>
Signed-off-by: NYang Sheng <yang.sheng@intel.com>
Signed-off-by: NArtem Blagodarenko <artem.blagodarenko@seagate.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>

e08ac99f

30 5月, 2017 1 次提交

ext4: fix fdatasync(2) after extent manipulation operations · 67a7d5f5

由 Jan Kara 提交于 5月 29, 2017

Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

67a7d5f5

27 5月, 2017 2 次提交

ext4: fix data corruption for mmap writes · a056bdaa

由 Jan Kara 提交于 5月 26, 2017

mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.
Reported-by: NMichael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d518Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

a056bdaa

ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO · 4f8caa60

由 Jan Kara 提交于 5月 26, 2017

When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:

CPU0					CPU1

page fault				page fault
...					...
ext4_map_blocks()
  ext4_ext_map_blocks()
    ext4_ext_handle_unwritten_extents()
      ext4_ext_convert_to_initialized()
	- zero out converted extent
	ext4_zeroout_es()
	  - inserts extent as initialized in status tree

					ext4_map_blocks()
					  ext4_es_lookup_extent()
					    - finds initialized extent
					write data
  ext4_issue_zeroout()
    - zeroes out new extent overwriting data

This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.

Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.

CC: stable@vger.kernel.org
Fixes: 12735f88Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

4f8caa60

25 5月, 2017 5 次提交

ext4: fix quota charging for shared xattr blocks · b8cb5a54

由 Tahsin Erdogan 提交于 5月 24, 2017

ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

b8cb5a54

ext4: remove redundant check for encrypted file on dio write path · c41d342b

由 Eric Biggers 提交于 5月 24, 2017

Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO().  There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO().  There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.

Just remove the unnecessary BUG_ON().
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NDavid Gstir <david@sigma-star.at>
Reviewed-by: NJan Kara <jack@suse.cz>

c41d342b

ext4: remove unused d_name argument from ext4_search_dir() et al. · d6b97550

由 Eric Biggers 提交于 5月 24, 2017

Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

d6b97550

ext4: fix off-by-one error when writing back pages before dio read · e5465795

由 Eric Biggers 提交于 5月 24, 2017

The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.

Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.

Fixes: 16c54688 ("ext4: Allow parallel DIO reads")
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

e5465795

ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff() · 624327f8

由 Eryu Guan 提交于 5月 24, 2017

ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.
Signed-off-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

624327f8

22 5月, 2017 6 次提交

ext4: keep existing extra fields when inode expands · 887a9730

由 Konstantin Khlebnikov 提交于 5月 21, 2017

ext4_expand_extra_isize() should clear only space between old and new
size.

Fixes: 6dd4ee7c # v2.6.23
Cc: stable@vger.kernel.org
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

887a9730

ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors · 9651e6b2

由 Konstantin Khlebnikov 提交于 5月 21, 2017

I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().

This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
  we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.

Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()

Fixes: adb7ef60 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

9651e6b2

ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff() · 3f1d5bad

由 Jan Kara 提交于 5月 21, 2017

There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

3f1d5bad

ext4: fix SEEK_HOLE · 7d95eddf

由 Jan Kara 提交于 5月 21, 2017

Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence	Result
HOLE	0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: stable@vger.kernel.org
Fixes: c8c0df24
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

7d95eddf

jbd2: preserve original nofs flag during journal restart · b4709067

由 Tahsin Erdogan 提交于 5月 21, 2017

When a transaction starts, start_this_handle() saves current
PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
Journal restart is a special case that calls start_this_handle() without
stopping the transaction. start_this_handle() isn't aware that the
original value is already stored so it overwrites it with current value.

For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
at the end:

  jbd2_journal_start()
  jbd2__journal_restart()
  jbd2_journal_stop()

Make jbd2__journal_restart() restore the original value before calling
start_this_handle().

Fixes: 81378da6 ("jbd2: mark the transaction context with the scope GFP_NOFS context")
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

b4709067

ext4: clear lockdep subtype for quota files on quota off · 964edf66

由 Jan Kara 提交于 5月 21, 2017

Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.

Cc: stable@vger.kernel.org
Fixes: daf647d2Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

964edf66

17 5月, 2017 1 次提交

fuseblk: Fix warning in super_setup_bdi_name() · 69c8ebf8

由 Jan Kara 提交于 5月 16, 2017

Commit 5f7f7543 "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.

Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().

Fixes: 5f7f7543 ("fuse: Convert to separately allocated bdi")
Reported-by: NRakesh Pandit <rakesh@tuxera.com>
Tested-by: NRakesh Pandit <rakesh@tuxera.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

69c8ebf8

14 5月, 2017 1 次提交

dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case · f5705aa8

由 Dan Williams 提交于 5月 13, 2017

Tetsuo reports:

  fs/built-in.o: In function `xfs_file_iomap_end':
  xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
  fs/built-in.o: In function `xfs_file_iomap_begin':
  xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
  make: *** [vmlinux] Error 1
  $ grep DAX .config
  CONFIG_DAX=m
  # CONFIG_DEV_DAX is not set
  # CONFIG_FS_DAX is not set

When FS_DAX=n we can/must throw away the dax code in filesystems.
Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
nops in the FS_DAX=n case.

Cc: <linux-xfs@vger.kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: NTony Luck <tony.luck@intel.com>
Fixes: ef510424 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f5705aa8

13 5月, 2017 10 次提交

[CIFS] Minor cleanup of xattr query function · 67b4c889

由 Steve French 提交于 5月 12, 2017

Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).
Signed-off-by: NSteve French <steve.french@primarydata.com>

67b4c889

fs: cifs: transport: Use time_after for time comparison · 4328fea7

由 Karim Eshapa 提交于 5月 12, 2017

Use time_after kernel macro for time comparison
that has safety check.
Signed-off-by: NKarim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

4328fea7

SMB2: Fix share type handling · cd123007

由 Christophe JAILLET 提交于 5月 12, 2017

In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK    0x01
#define SMB2_SHARE_TYPE_PIPE    0x02
#define SMB2_SHARE_TYPE_PRINT   0x03

Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.

So, test the ShareType value for equality instead.

Fixes: faaf946a ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: NAurelien Aptel <aaptel@suse.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

cd123007

cifs: cifsacl: Use a temporary ops variable to reduce code length · ecdcf622

由 Joe Perches via samba-technical 提交于 5月 07, 2017

Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.

$ size fs/cifs/cifsacl.o*
   text	   data	    bss	    dec	    hex	filename
   5338	    136	      8	   5482	   156a	fs/cifs/cifsacl.o.new
   5371	    136	      8	   5515	   158b	fs/cifs/cifsacl.o.old
Signed-off-by: NJoe Perches <joe@perches.com>
Acked-by: NShirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

ecdcf622

dax: fix PMD data corruption when fault races with write · 876f2946

由 Ross Zwisler 提交于 5月 12, 2017

This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

876f2946

dax: fix data corruption when fault races with write · 13e451fd

由 Jan Kara 提交于 5月 12, 2017

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6e
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

13e451fd

ext4: return to starting transaction in ext4_dax_huge_fault() · fb26a1cb

由 Jan Kara 提交于 5月 12, 2017

DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6e
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fb26a1cb

mm: fix data corruption due to stale mmap reads · cd656375

由 Jan Kara 提交于 5月 12, 2017

Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX.  That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:

 - open an mmap over a 2MiB hole

 - read from a 2MiB hole, faulting in a 2MiB zero page

 - write to the hole with write(3p). The write succeeds but we
   incorrectly leave the 2MiB zero page mapping intact.

 - via the mmap, read the data that was just written. Since the zero
   page mapping is still intact we read back zeroes instead of the new
   data.

Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.

Fixes: c6dcf52c
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd656375

dax: prevent invalidation of mapped DAX entries · 4636e70b

由 Ross Zwisler 提交于 5月 12, 2017

Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.

This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.

This patch (of 4):

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

  invalidate_mapping_pages()
    invalidate_exceptional_entry()
      dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped.  This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry.  This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Reported-by: NJan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4636e70b

Tigran has moved · cea58224

由 Andrew Morton 提交于 5月 12, 2017

Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cea58224

11 5月, 2017 3 次提交

filesystem-dax: fix broken __dax_zero_page_range() conversion · e84b83b9

由 Dan Williams 提交于 5月 10, 2017

The conversion of __dax_zero_page_range() to 'struct dax_operations'
caused it to frequently fail. The mistake was treating the @size
parameter as a dax mapping length rather than just a length of the
clear_pmem() operation. The dax mapping length is assumed to be hard
coded as PAGE_SIZE.

Without this fix any page unaligned zeroing request will trigger a
-EINVAL return from bdev_dax_pgoff().

Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Fixes: cccbce67 ("filesystem-dax: convert to dax_direct_access()")
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e84b83b9

nfsd: Fix up the "supattr_exclcreat" attributes · b26b78cb

由 Trond Myklebust 提交于 5月 09, 2017

If an NFSv4 client asks us for the supattr_exclcreat, then we must
not return attributes that are unsupported by this minor version.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Fixes: 75976de6 ("NFSD: Return word2 bitmask if setting security..,")
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b26b78cb

nfsd: encoders mustn't use unitialized values in error cases · f961e3f2

由 J. Bruce Fields 提交于 5月 05, 2017

In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.

This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.

GETDEVICEINFO has the same problem.
Reported-by: NAri Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f961e3f2

10 5月, 2017 1 次提交

Don't delay freeing mids when blocked on slow socket write of request · de1892b8

由 Steve French 提交于 5月 04, 2017

When processing responses, and in particular freeing mids (DeleteMidQEntry),
which is very important since it also frees the associated buffers (cifs_buf_release),
we can block a long time if (writes to) socket is slow due to low memory or networking
issues.

We can block in send (smb request) waiting for memory, and be blocked in processing
responess (which could free memory if we let it) - since they both grab the
server->srv_mutex.

In practice, in the DeleteMidQEntry case - there is no reason we need to
grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
us to free memory faster.
Signed-off-by: NSteve French <steve.french@primarydata.com>
Acked-by: NPavel Shilovsky <pshilov@microsoft.com>

de1892b8

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多