1. 22 6月, 2017 10 次提交
    • T
      ext4: extended attribute value size limit is enforced by vfs · 0eefb107
      Tahsin Erdogan 提交于
      EXT4_XATTR_MAX_LARGE_EA_SIZE definition in ext4 is currently unused.
      Besides, vfs enforces its own 64k limit which makes the 1MB limit in
      ext4 redundant. Remove it.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0eefb107
    • T
      ext4: fix ref counting for ea_inode · 1e7d359d
      Tahsin Erdogan 提交于
      The ref count on ea_inode is incremented by
      ext4_xattr_inode_orphan_add() which is supposed to be decremented by
      ext4_xattr_inode_array_free(). The decrement is conditioned on whether
      the ea_inode is currently on the orphan list. However, the orphan list
      addition only happens when journaling is enabled. In non-journaled case,r
      we fail to release the ref count causing an error message like below.
      
      "VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.
      Have a nice day..."
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1e7d359d
    • T
      ext4: call journal revoke when freeing ea_inode blocks · ddfa17e4
      Tahsin Erdogan 提交于
      ea_inode contents are treated as metadata, that's why it is journaled
      during initial writes. Failing to call revoke during freeing could cause
      user data to be overwritten with original ea_inode contents during journal
      replay.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ddfa17e4
    • T
      ext4: ea_inode owner should be the same as the inode owner · 9e1ba001
      Tahsin Erdogan 提交于
      Quota charging is based on the ownership of the inode. Currently, the
      xattr inode owner is set to the caller which may be different from the
      parent inode owner. This is inconsistent with how quota is charged for
      xattr block and regular data block writes.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9e1ba001
    • T
      ext4: attach jinode after creation of xattr inode · bd3b963b
      Tahsin Erdogan 提交于
      In data=ordered mode jinode needs to be attached to the xattr inode when
      writing data to it. Attachment normally occurs during file open for regular
      files. Since we are not using file interface to write to the xattr inode,
      the jinode attach needs to be done manually.
      
      Otherwise the following crash occurs in data=ordered mode.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: jbd2_journal_file_inode+0x37/0x110
       PGD 13b3c0067
       P4D 13b3c0067
       PUD 137660067
       PMD 0
      
       Oops: 0000 [#1] SMP
       CPU: 3 PID: 1877 Comm: python Not tainted 4.12.0-rc1+ #749
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       task: ffff88010e368980 task.stack: ffffc90000374000
       RIP: 0010:jbd2_journal_file_inode+0x37/0x110
       RSP: 0018:ffffc90000377980 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff880123b06230 RCX: 0000000000280000
       RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88012c8585d0
       RBP: ffffc900003779b0 R08: 0000000000000202 R09: 0000000000000001
       R10: 0000000000000000 R11: 0000000000000400 R12: ffff8801111f81c0
       R13: ffff88013b2b6800 R14: ffffc90000377ab0 R15: 0000000000000001
       FS:  00007f0c99b77740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000136d91000 CR4: 00000000000006e0
       Call Trace:
        jbd2_journal_inode_add_write+0xe/0x10
        ext4_map_blocks+0x59e/0x620
        ext4_xattr_set_entry+0x501/0x7d0
        ext4_xattr_block_set+0x1b2/0x9b0
        ext4_xattr_set_handle+0x322/0x4f0
        ext4_xattr_set+0x144/0x1a0
        ext4_xattr_user_set+0x34/0x40
        __vfs_setxattr+0x66/0x80
        __vfs_setxattr_noperm+0x69/0x1c0
        vfs_setxattr+0xa2/0xb0
        setxattr+0x12e/0x150
        path_setxattr+0x87/0xb0
        SyS_setxattr+0xf/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bd3b963b
    • T
      ext4: do not set posix acls on xattr inodes · 1b917ed8
      Tahsin Erdogan 提交于
      We don't need acls on xattr inodes because they are not directly
      accessible from user mode.
      
      Besides lockdep complains about recursive locking of xattr_sem as seen
      below.
      
        =============================================
        [ INFO: possible recursive locking detected ]
        4.11.0-rc8+ #402 Not tainted
        ---------------------------------------------
        python/1894 is trying to acquire lock:
         (&ei->xattr_sem){++++..}, at: [<ffffffff804878a6>] ext4_xattr_get+0x66/0x270
      
        but task is already holding lock:
         (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&ei->xattr_sem);
          lock(&ei->xattr_sem);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        3 locks held by python/1894:
         #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d829f>] mnt_want_write+0x1f/0x50
         #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803dda27>] vfs_setxattr+0x57/0xb0
         #2:  (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
      
        stack backtrace:
        CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         __lock_acquire+0x5f3/0x1830
         lock_acquire+0xb5/0x1d0
         down_read+0x2f/0x60
         ext4_xattr_get+0x66/0x270
         ext4_get_acl+0x43/0x1e0
         get_acl+0x72/0xf0
         posix_acl_create+0x5e/0x170
         ext4_init_acl+0x21/0xc0
         __ext4_new_inode+0xffd/0x16b0
         ext4_xattr_set_entry+0x5ea/0xb70
         ext4_xattr_block_set+0x1b5/0x970
         ext4_xattr_set_handle+0x351/0x5d0
         ext4_xattr_set+0x124/0x180
         ext4_xattr_user_set+0x34/0x40
         __vfs_setxattr+0x66/0x80
         __vfs_setxattr_noperm+0x69/0x1c0
         vfs_setxattr+0xa2/0xb0
         setxattr+0x129/0x160
         path_setxattr+0x87/0xb0
         SyS_setxattr+0xf/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1b917ed8
    • T
      ext4: lock inode before calling ext4_orphan_add() · 0de5983d
      Tahsin Erdogan 提交于
      ext4_orphan_add() requires caller to be holding the inode lock.
      Add missing lock statements.
      
       WARNING: CPU: 3 PID: 1806 at fs/ext4/namei.c:2731 ext4_orphan_add+0x4e/0x240
       CPU: 3 PID: 1806 Comm: python Not tainted 4.12.0-rc1+ #746
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       task: ffff880135d466c0 task.stack: ffffc900014b0000
       RIP: 0010:ext4_orphan_add+0x4e/0x240
       RSP: 0018:ffffc900014b3d50 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff8801348fe1f0 RCX: ffffc900014b3c64
       RDX: 0000000000000000 RSI: ffff8801348fe1f0 RDI: ffff8801348fe1f0
       RBP: ffffc900014b3da0 R08: 0000000000000000 R09: ffffffff80e82025
       R10: 0000000000004692 R11: 000000000000468d R12: ffff880137598000
       R13: ffff880137217000 R14: ffff880134ac58d0 R15: 0000000000000000
       FS:  00007fc50f09e740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000008bc2e0 CR3: 00000001375ac000 CR4: 00000000000006e0
       Call Trace:
        ext4_xattr_inode_orphan_add.constprop.19+0x9d/0xf0
        ext4_xattr_delete_inode+0x1c4/0x2f0
        ext4_evict_inode+0x15a/0x7f0
        evict+0xc0/0x1a0
        iput+0x16a/0x270
        do_unlinkat+0x172/0x290
        SyS_unlink+0x11/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0de5983d
    • T
      ext4: fix lockdep warning about recursive inode locking · 33d201e0
      Tahsin Erdogan 提交于
      Setting a large xattr value may require writing the attribute contents
      to an external inode. In this case we may need to lock the xattr inode
      along with the parent inode. This doesn't pose a deadlock risk because
      xattr inodes are not directly visible to the user and their access is
      restricted.
      
      Assign a lockdep subclass to xattr inode's lock.
      
       ============================================
       WARNING: possible recursive locking detected
       4.12.0-rc1+ #740 Not tainted
       --------------------------------------------
       python/1822 is trying to acquire lock:
        (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0
      
       but task is already holding lock:
        (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&sb->s_type->i_mutex_key#15);
         lock(&sb->s_type->i_mutex_key#15);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by python/1822:
        #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
        #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
        #2:  (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
        #3:  (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0
      
       stack backtrace:
       CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       Call Trace:
        dump_stack+0x67/0x9e
        __lock_acquire+0x5f3/0x1750
        lock_acquire+0xb5/0x1d0
        down_write+0x2c/0x60
        ext4_xattr_set_entry+0x65a/0x7b0
        ext4_xattr_block_set+0x1b2/0x9b0
        ext4_xattr_set_handle+0x322/0x4f0
        ext4_xattr_set+0x144/0x1a0
        ext4_xattr_user_set+0x34/0x40
        __vfs_setxattr+0x66/0x80
        __vfs_setxattr_noperm+0x69/0x1c0
        vfs_setxattr+0xa2/0xb0
        setxattr+0x12e/0x150
        path_setxattr+0x87/0xb0
        SyS_setxattr+0xf/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      33d201e0
    • A
      ext4: xattr-in-inode support · e50e5129
      Andreas Dilger 提交于
      Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
      
      If the size of an xattr value is larger than will fit in a single
      external block, then the xattr value will be saved into the body
      of an external xattr inode.
      
      The also helps support a larger number of xattr, since only the headers
      will be stored in the in-inode space or the single external block.
      
      The inode is referenced from the xattr header via "e_value_inum",
      which was formerly "e_value_block", but that field was never used.
      The e_value_size still contains the xattr size so that listing
      xattrs does not need to look up the inode if the data is not accessed.
      
      struct ext4_xattr_entry {
              __u8    e_name_len;     /* length of name */
              __u8    e_name_index;   /* attribute name index */
              __le16  e_value_offs;   /* offset in disk block of value */
              __le32  e_value_inum;   /* inode in which value is stored */
              __le32  e_value_size;   /* size of attribute value */
              __le32  e_hash;         /* hash value of name and value */
              char    e_name[0];      /* attribute name */
      };
      
      The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
      holds a back-reference to the owning inode in its i_mtime field,
      allowing the ext4/e2fsck to verify the correct inode is accessed.
      
      [ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]
      
      Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
      Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424Signed-off-by: NKalpak Shah <kalpak.shah@sun.com>
      Signed-off-by: NJames Simmons <uja.ornl@gmail.com>
      Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      e50e5129
    • A
      ext4: add largedir feature · e08ac99f
      Artem Blagodarenko 提交于
      This INCOMPAT_LARGEDIR feature allows larger directories to be created
      in ldiskfs, both with directory sizes over 2GB and and a maximum htree
      depth of 3 instead of the current limit of 2. These features are needed
      in order to exceed the current limit of approximately 10M entries in a
      single directory.
      
      This patch was originally written by Yang Sheng to support the Lustre server.
      
      [ Bumped the credits needed to update an indexed directory -- tytso ]
      Signed-off-by: NLiang Zhen <liang.zhen@intel.com>
      Signed-off-by: NYang Sheng <yang.sheng@intel.com>
      Signed-off-by: NArtem Blagodarenko <artem.blagodarenko@seagate.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
      e08ac99f
  2. 30 5月, 2017 1 次提交
    • J
      ext4: fix fdatasync(2) after extent manipulation operations · 67a7d5f5
      Jan Kara 提交于
      Currently, extent manipulation operations such as hole punch, range
      zeroing, or extent shifting do not record the fact that file data has
      changed and thus fdatasync(2) has a work to do. As a result if we crash
      e.g. after a punch hole and fdatasync, user can still possibly see the
      punched out data after journal replay. Test generic/392 fails due to
      these problems.
      
      Fix the problem by properly marking that file data has changed in these
      operations.
      
      CC: stable@vger.kernel.org
      Fixes: a4bb6b64Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      67a7d5f5
  3. 27 5月, 2017 2 次提交
    • J
      ext4: fix data corruption for mmap writes · a056bdaa
      Jan Kara 提交于
      mpage_submit_page() can race with another process growing i_size and
      writing data via mmap to the written-back page. As mpage_submit_page()
      samples i_size too early, it may happen that ext4_bio_write_page()
      zeroes out too large tail of the page and thus corrupts user data.
      
      Fix the problem by sampling i_size only after the page has been
      write-protected in page tables by clear_page_dirty_for_io() call.
      Reported-by: NMichael Zimmer <michael@swarm64.com>
      CC: stable@vger.kernel.org
      Fixes: cb20d518Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      a056bdaa
    • J
      ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO · 4f8caa60
      Jan Kara 提交于
      When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
      allocated blocks and these blocks are actually converted from unwritten
      extent the following race can happen:
      
      CPU0					CPU1
      
      page fault				page fault
      ...					...
      ext4_map_blocks()
        ext4_ext_map_blocks()
          ext4_ext_handle_unwritten_extents()
            ext4_ext_convert_to_initialized()
      	- zero out converted extent
      	ext4_zeroout_es()
      	  - inserts extent as initialized in status tree
      
      					ext4_map_blocks()
      					  ext4_es_lookup_extent()
      					    - finds initialized extent
      					write data
        ext4_issue_zeroout()
          - zeroes out new extent overwriting data
      
      This problem can be reproduced by generic/340 for the fallocated case
      for the last block in the file.
      
      Fix the problem by avoiding zeroing out the area we are mapping with
      ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
      to zero out this area in the first place as the caller asked us to
      convert the area to initialized because he is just going to write data
      there before the transaction finishes. To achieve this we delete the
      special case of zeroing out full extent as that will be handled by the
      cases below zeroing only the part of the extent that needs it. We also
      instruct ext4_split_extent() that the middle of extent being split
      contains data so that ext4_split_extent_at() cannot zero out full extent
      in case of ENOSPC.
      
      CC: stable@vger.kernel.org
      Fixes: 12735f88Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f8caa60
  4. 25 5月, 2017 5 次提交
  5. 22 5月, 2017 5 次提交
    • K
      ext4: keep existing extra fields when inode expands · 887a9730
      Konstantin Khlebnikov 提交于
      ext4_expand_extra_isize() should clear only space between old and new
      size.
      
      Fixes: 6dd4ee7c # v2.6.23
      Cc: stable@vger.kernel.org
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      887a9730
    • K
      ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors · 9651e6b2
      Konstantin Khlebnikov 提交于
      I've got another report about breaking ext4 by ENOMEM error returned from
      ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
      This time inside ext4_discard_preallocations().
      
      This patch replaces ext4_error() with ext4_warning() where errors returned
      from ext4_mb_load_buddy() are not fatal and handled by caller:
      * ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
        we'll try to discard other group or return ENOSPC into user-space.
      * ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.
      
      Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
      * ext4_discard_preallocations()
      * ext4_mb_discard_lg_preallocations()
      
      Fixes: adb7ef60 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9651e6b2
    • J
      ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff() · 3f1d5bad
      Jan Kara 提交于
      There is an off-by-one error in loop termination conditions in
      ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
      desired range if 'endoff' is page aligned. It doesn't have any visible
      effects but still it is good to fix it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3f1d5bad
    • J
      ext4: fix SEEK_HOLE · 7d95eddf
      Jan Kara 提交于
      Currently, SEEK_HOLE implementation in ext4 may both return that there's
      a hole at some offset although that offset already has data and skip
      some holes during a search for the next hole. The first problem is
      demostrated by:
      
      xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
      wrote 57344/57344 bytes at offset 0
      56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
      Whence	Result
      HOLE	0
      
      Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
      a hole although we have written data there. The second problem can be
      demonstrated by:
      
      xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
             -c "seek -h 0" file
      
      wrote 57344/57344 bytes at offset 0
      56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
      wrote 8192/8192 bytes at offset 131072
      8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
      Whence	Result
      HOLE	139264
      
      Where we can see that hole at offsets 56k..128k has been ignored by the
      SEEK_HOLE call.
      
      The underlying problem is in the ext4_find_unwritten_pgoff() which is
      just buggy. In some cases it fails to update returned offset when it
      finds a hole (when no pages are found or when the first found page has
      higher index than expected), in some cases conditions for detecting hole
      are just missing (we fail to detect a situation where indices of
      returned pages are not contiguous).
      
      Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
      indices and also handle all cases where we got less pages then expected
      in one place and handle it properly there.
      
      CC: stable@vger.kernel.org
      Fixes: c8c0df24
      CC: Zheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7d95eddf
    • J
      ext4: clear lockdep subtype for quota files on quota off · 964edf66
      Jan Kara 提交于
      Quota files have special ranking of i_data_sem lock. We inform lockdep
      about it when turning on quotas however when turning quotas off, we
      don't clear the lockdep subclass from i_data_sem lock and thus when the
      inode gets later reused for a normal file or directory, lockdep gets
      confused and complains about possible deadlocks. Fix the problem by
      resetting lockdep subclass of i_data_sem on quota off.
      
      Cc: stable@vger.kernel.org
      Fixes: daf647d2Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      964edf66
  6. 14 5月, 2017 1 次提交
    • D
      dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case · f5705aa8
      Dan Williams 提交于
      Tetsuo reports:
      
        fs/built-in.o: In function `xfs_file_iomap_end':
        xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
        fs/built-in.o: In function `xfs_file_iomap_begin':
        xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
        make: *** [vmlinux] Error 1
        $ grep DAX .config
        CONFIG_DAX=m
        # CONFIG_DEV_DAX is not set
        # CONFIG_FS_DAX is not set
      
      When FS_DAX=n we can/must throw away the dax code in filesystems.
      Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
      nops in the FS_DAX=n case.
      
      Cc: <linux-xfs@vger.kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Fixes: ef510424 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f5705aa8
  7. 13 5月, 2017 1 次提交
  8. 09 5月, 2017 2 次提交
    • M
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko 提交于
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424
  9. 04 5月, 2017 5 次提交
    • E
      ext4: clean up ext4_match() and callers · d9b9f8d5
      Eric Biggers 提交于
      When ext4 encryption was originally merged, we were encrypting the
      user-specified filename in ext4_match(), introducing a lot of additional
      complexity into ext4_match() and its callers.  This has since been
      changed to encrypt the filename earlier, so we can remove the gunk
      that's no longer needed.  This more or less reverts ext4_search_dir()
      and ext4_find_dest_de() to the way they were in the v4.0 kernel.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d9b9f8d5
    • E
      ext4: switch to using fscrypt_match_name() · 067d1023
      Eric Biggers 提交于
      Switch ext4 directory searches to use the fscrypt_match_name() helper
      function.  There should be no functional change.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      067d1023
    • E
      fscrypt: avoid collisions when presenting long encrypted filenames · 6b06cdee
      Eric Biggers 提交于
      When accessing an encrypted directory without the key, userspace must
      operate on filenames derived from the ciphertext names, which contain
      arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
      we can't always just base64-encode the ciphertext, since that may make
      it too long.  Currently, this is solved by presenting long names in an
      abbreviated form containing any needed filesystem-specific hashes (e.g.
      to identify a directory block), then the last 16 bytes of ciphertext.
      This needs to be sufficient to identify the actual name on lookup.
      
      However, there is a bug.  It seems to have been assumed that due to the
      use of a CBC (ciphertext block chaining)-based encryption mode, the last
      16 bytes (i.e. the AES block size) of ciphertext would depend on the
      full plaintext, preventing collisions.  However, we actually use CBC
      with ciphertext stealing (CTS), which handles the last two blocks
      specially, causing them to appear "flipped".  Thus, it's actually the
      second-to-last block which depends on the full plaintext.
      
      This caused long filenames that differ only near the end of their
      plaintexts to, when observed without the key, point to the wrong inode
      and be undeletable.  For example, with ext4:
      
          # echo pass | e4crypt add_key -p 16 edir/
          # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
          # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
          100000
          # sync
          # echo 3 > /proc/sys/vm/drop_caches
          # keyctl new_session
          # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
          2004
          # rm -rf edir/
          rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
          ...
      
      To fix this, when presenting long encrypted filenames, encode the
      second-to-last block of ciphertext rather than the last 16 bytes.
      
      Although it would be nice to solve this without depending on a specific
      encryption mode, that would mean doing a cryptographic hash like SHA-256
      which would be much less efficient.  This way is sufficient for now, and
      it's still compatible with encryption modes like HEH which are strong
      pseudorandom permutations.  Also, changing the presented names is still
      allowed at any time because they are only provided to allow applications
      to do things like delete encrypted directories.  They're not designed to
      be used to persistently identify files --- which would be hard to do
      anyway, given that they're encrypted after all.
      
      For ease of backports, this patch only makes the minimal fix to both
      ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
      ciphertext block yet.  Follow-on patches will clean things up properly
      and make the filesystems use a shared helper function.
      
      Fixes: 5de0b4d0 ("ext4 crypto: simplify and speed up filename encryption")
      Reported-by: NGwendal Grignou <gwendal@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6b06cdee
    • E
      ext4: remove "nokey" check from ext4_lookup() · 8c68084b
      Eric Biggers 提交于
      Now that fscrypt_has_permitted_context() correctly handles the case
      where we have the key for the parent directory but not the child, we
      don't need to try to work around this in ext4_lookup().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8c68084b
    • J
      ext4: mark superblock writes synchronous for nobarrier mounts · 00473374
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
      generic_make_request_checks() however strips REQ_FUA flag from a bio
      when the storage doesn't report volatile write cache and thus write
      effectively becomes asynchronous which can lead to performance
      regressions. This affects superblock writes for ext4. Fix the problem
      by marking superblock writes always as synchronous.
      
      Fixes: b685d3d6
      CC: linux-ext4@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      00473374
  10. 02 5月, 2017 1 次提交
    • E
      ext4: inherit encryption xattr before other xattrs · aa1dca3b
      Eric Biggers 提交于
      When using both encryption and SELinux (or another feature that requires
      an xattr per file) on a filesystem with 256-byte inodes, each file's
      xattrs usually spill into an external xattr block.  Currently, the
      xattrs are inherited in the order ACL, security, then encryption.
      Therefore, if spillage occurs, the encryption xattr will always end up
      in the external block.  This is not ideal because the encryption xattrs
      contain a nonce, so they will always be unique and will prevent the
      external xattr blocks from being deduplicated.
      
      To improve the situation, change the inheritance order to encryption,
      ACL, then security.  This gives the encryption xattr a better chance to
      be stored in-inode, allowing the other xattr(s) to be deduplicated.
      
      Note that it may be better for userspace to format the filesystem with
      512-byte inodes in this case.  However, it's not the default.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      aa1dca3b
  11. 01 5月, 2017 2 次提交
    • T
      ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio() · 72d622b4
      Theodore Ts'o 提交于
      Add fallback code and a WARN_ONCE() call instead of a BUG_ON() in
      the ext4_end_bio() function.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      72d622b4
    • J
      ext4: avoid unnecessary transaction stalls during writeback · dddbd6ac
      Jan Kara 提交于
      Currently ext4_writepages() submits all pages with transaction started.
      When no page needs block allocation or extent conversion we can submit
      all dirty pages in the inode while holding a single transaction handle
      and when device is congested this can take significant amount of time.
      Thus ext4_writepages() can block transaction commits for extended
      periods of time.
      
      Take for example a simple benchmark simulating PostgreSQL database
      (pgioperf in mmtest). The benchmark runs 16 processes doing random reads
      from a huge file, one process doing random writes to the huge file, and
      one process doing sequential writes to a small files and frequently
      running fsync. With unpatched kernel transaction commits take on average
      ~18s with standard deviation of ~41s, top 5 commit times are:
      
      274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.
      
      After this patch transaction commits take on average 0.1s with standard
      deviation of 0.15s, top 5 commit times are:
      
      0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s
      
      [ Modified so we use an explicit do_map flag instead of relying on
        io_end not being allocated, the since io_end->inode is needed for I/O
        error handling. -- tytso ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dddbd6ac
  12. 30 4月, 2017 5 次提交
    • A
      ext4: preload block group descriptors · 85c8f176
      Andrew Perepechko 提交于
      With enabled meta_bg option block group descriptors
      reading IO is not sequential and requires optimization.
      Signed-off-by: NAndrew Perepechko <andrew.perepechko@seagate.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      85c8f176
    • E
      ext4: make ext4_shutdown() static · 1a20a630
      Eric Biggers 提交于
      Make the ext4_shutdown() function static, as suggested by running sparse
      ('make C=2 fs/ext4/').  This was the only such warning in fs/ext4/.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1a20a630
    • D
      ext4: support GETFSMAP ioctls · 0c9ec4be
      Darrick J. Wong 提交于
      Support the GETFSMAP ioctls so that we can use the xfs free space
      management tools to probe ext4 as well.  Note that this is a partial
      implementation -- we only report fixed-location metadata and free space;
      everything else is reported as "unknown".
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0c9ec4be
    • E
      ext4: evict inline data when writing to memory map · 7b4cc978
      Eric Biggers 提交于
      Currently the case of writing via mmap to a file with inline data is not
      handled.  This is maybe a rare case since it requires a writable memory
      map of a very small file, but it is trivial to trigger with on
      inline_data filesystem, and it causes the
      'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
      ext4_writepages() to be hit:
      
          mkfs.ext4 -O inline_data /dev/vdb
          mount /dev/vdb /mnt
          xfs_io -f /mnt/file \
      	-c 'pwrite 0 1' \
      	-c 'mmap -w 0 1m' \
      	-c 'mwrite 0 1' \
      	-c 'fsync'
      
      	kernel BUG at fs/ext4/inode.c:2723!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      	task: ffff88003d3a8040 task.stack: ffffc90000300000
      	RIP: 0010:ext4_writepages+0xc89/0xf8a
      	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
      	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
      	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
      	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
      	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
      	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
      	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
      	Call Trace:
      	 ? _raw_spin_unlock+0x27/0x2a
      	 ? kvm_clock_read+0x1e/0x20
      	 do_writepages+0x23/0x2c
      	 ? do_writepages+0x23/0x2c
      	 __filemap_fdatawrite_range+0x80/0x87
      	 filemap_write_and_wait_range+0x67/0x8c
      	 ext4_sync_file+0x20e/0x472
      	 vfs_fsync_range+0x8e/0x9f
      	 ? syscall_trace_enter+0x25b/0x2d0
      	 vfs_fsync+0x1c/0x1e
      	 do_fsync+0x31/0x4a
      	 SyS_fsync+0x10/0x14
      	 do_syscall_64+0x69/0x131
      	 entry_SYSCALL64_slow_path+0x25/0x25
      
      We could try to be smart and keep the inline data in this case, or at
      least support delayed allocation when allocating the block, but these
      solutions would be more complicated and don't seem worthwhile given how
      rare this case seems to be.  So just fix the bug by calling
      ext4_convert_inline_data() when we're asked to make a page writable, so
      that any inline data gets evicted, with the block allocated immediately.
      Reported-by: NNick Alcock <nick.alcock@oracle.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7b4cc978
    • E
      ext4: remove ext4_xattr_check_entry() · 6ba644b9
      Eric Biggers 提交于
      ext4_xattr_check_entry() was redundant with validation of the full xattr
      entries list in ext4_xattr_check_entries(), which all callers also did.
      ext4_xattr_check_entry() also didn't actually do correct validation;
      specifically, it never checked that the value doesn't overlap the xattr
      names, nor did it account for padding when checking whether the xattr
      value overflows the available space.  So remove it to eliminate any
      potential confusion.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6ba644b9