1. 21 7月, 2015 7 次提交
    • K
      nfsd: Add macro NFS_ACL_MASK for ACL · 7b8f4586
      Kinglong Mee 提交于
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      7b8f4586
    • K
      nfsd: Remove duplicate define of IDMAP_NAMESZ/IDMAP_TYPE_xx · e446d66d
      Kinglong Mee 提交于
      Just using the macro defined in nfs_idmap.h.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e446d66d
    • K
      nfsd: Drop including client's header file nfs_fs.h · faf996a6
      Kinglong Mee 提交于
      nfs_fs.h is a client's header file.
      
      # ll fs/nfsd/nfs4acl.o fs/nfsd/nfsd.ko
      -rw-r--r--. 1 root root 328248 Jul  3 19:26 fs/nfsd/nfs4acl.o
      -rw-r--r--. 1 root root 7452016 Jul  3 19:26 fs/nfsd/nfsd.ko
      
      After this patch,
      # ll fs/nfsd/nfs4acl.o fs/nfsd/nfsd.ko
      -rw-r--r--. 1 root root 150872 Jul  3 19:15 fs/nfsd/nfs4acl.o
      -rw-r--r--. 1 root root 7273792 Jul  3 19:23 fs/nfsd/nfsd.ko
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      faf996a6
    • K
      nfsd: Set lc_size_chg before ops->proc_layoutcommit · d8398fc1
      Kinglong Mee 提交于
      After proc_layoutcommit success, i_size_read(inode) always >= new_size.
      Just set lc_size_chg before proc_layoutcommit, if proc_layoutcommit
      failed, nfsd will skip the lc_size_chg, so it's no harm.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      d8398fc1
    • K
      nfsd: Fix a memory leak in nfsd4_list_rec_dir() · 4691b271
      Kinglong Mee 提交于
      If lookup_one_len() failed, nfsd should free those memory allocated for fname.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      4691b271
    • K
      nfsd: Fix a file leak on nfsd4_layout_setlease failure · 1ca4b88e
      Kinglong Mee 提交于
      If nfsd4_layout_setlease fails, nfsd will not put ls->ls_file.
      
      Fix commit c5c707f9 "nfsd: implement pNFS layout recalls".
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      1ca4b88e
    • K
      nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem · c2227a39
      Kinglong Mee 提交于
      On an absent filesystem (one served by another server), we need to be
      able to handle requests for certain attributest (like fs_locations, so
      the client can find out which server does have the filesystem), but
      others we can't.
      
      We forgot to take that into account when adding another attribute
      bitmask work for the SECURITY_LABEL attribute.
      
      There an export entry with the "refer" option can result in:
      
      [   88.414272] kernel BUG at fs/nfsd/nfs4xdr.c:2249!
      [   88.414828] invalid opcode: 0000 [#1] SMP
      [   88.415368] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iosf_mbi ppdev btrfs coretemp crct10dif_pclmul crc32_pclmul crc32c_intel xor ghash_clmulni_intel raid6_pq vmw_balloon parport_pc parport i2c_piix4 shpchp vmw_vmci acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi mptscsih serio_raw mptbase e1000 scsi_transport_spi ata_generic pata_acpi [last unloaded: nfsd]
      [   88.417827] CPU: 0 PID: 2116 Comm: nfsd Not tainted 4.0.7-300.fc22.x86_64 #1
      [   88.418448] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
      [   88.419093] task: ffff880079146d50 ti: ffff8800785d8000 task.ti: ffff8800785d8000
      [   88.419729] RIP: 0010:[<ffffffffa04b3c10>]  [<ffffffffa04b3c10>] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
      [   88.420376] RSP: 0000:ffff8800785db998  EFLAGS: 00010206
      [   88.421027] RAX: 0000000000000001 RBX: 000000000018091a RCX: ffff88006668b980
      [   88.421676] RDX: 00000000fffef7fc RSI: 0000000000000000 RDI: ffff880078d05000
      [   88.422315] RBP: ffff8800785dbb58 R08: ffff880078d043f8 R09: ffff880078d4a000
      [   88.422968] R10: 0000000000010000 R11: 0000000000000002 R12: 0000000000b0a23a
      [   88.423612] R13: ffff880078d05000 R14: ffff880078683100 R15: ffff88006668b980
      [   88.424295] FS:  0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
      [   88.424944] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   88.425597] CR2: 00007f40bc370f90 CR3: 0000000035af5000 CR4: 00000000001407f0
      [   88.426285] Stack:
      [   88.426921]  ffff8800785dbaa8 ffffffffa049e4af ffff8800785dba08 ffffffff813298f0
      [   88.427585]  ffff880078683300 ffff8800769b0de8 0000089d00000001 0000000087f805e0
      [   88.428228]  ffff880000000000 ffff880079434a00 0000000000000000 ffff88006668b980
      [   88.428877] Call Trace:
      [   88.429527]  [<ffffffffa049e4af>] ? exp_get_by_name+0x7f/0xb0 [nfsd]
      [   88.430168]  [<ffffffff813298f0>] ? inode_doinit_with_dentry+0x210/0x6a0
      [   88.430807]  [<ffffffff8123833e>] ? d_lookup+0x2e/0x60
      [   88.431449]  [<ffffffff81236133>] ? dput+0x33/0x230
      [   88.432097]  [<ffffffff8123f214>] ? mntput+0x24/0x40
      [   88.432719]  [<ffffffff812272b2>] ? path_put+0x22/0x30
      [   88.433340]  [<ffffffffa049ac87>] ? nfsd_cross_mnt+0xb7/0x1c0 [nfsd]
      [   88.433954]  [<ffffffffa04b54e0>] nfsd4_encode_dirent+0x1b0/0x3d0 [nfsd]
      [   88.434601]  [<ffffffffa04b5330>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
      [   88.435172]  [<ffffffffa049c991>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
      [   88.435710]  [<ffffffffa049a530>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
      [   88.436447]  [<ffffffffa04abf30>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
      [   88.437011]  [<ffffffffa04b58cd>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
      [   88.437566]  [<ffffffffa04aa6dd>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
      [   88.438157]  [<ffffffffa0496103>] nfsd_dispatch+0xc3/0x220 [nfsd]
      [   88.438680]  [<ffffffffa006f0cb>] svc_process_common+0x43b/0x690 [sunrpc]
      [   88.439192]  [<ffffffffa0070493>] svc_process+0x103/0x1b0 [sunrpc]
      [   88.439694]  [<ffffffffa0495a57>] nfsd+0x117/0x190 [nfsd]
      [   88.440194]  [<ffffffffa0495940>] ? nfsd_destroy+0x90/0x90 [nfsd]
      [   88.440697]  [<ffffffff810bb728>] kthread+0xd8/0xf0
      [   88.441260]  [<ffffffff810bb650>] ? kthread_worker_fn+0x180/0x180
      [   88.441762]  [<ffffffff81789e58>] ret_from_fork+0x58/0x90
      [   88.442322]  [<ffffffff810bb650>] ? kthread_worker_fn+0x180/0x180
      [   88.442879] Code: 0f 84 93 05 00 00 83 f8 ea c7 85 a0 fe ff ff 00 00 27 30 0f 84 ba fe ff ff 85 c0 0f 85 a5 fe ff ff e9 e3 f9 ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 be 04 00 00 00 4c 89 ef 4c 89 8d 68 fe
      [   88.444052] RIP  [<ffffffffa04b3c10>] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
      [   88.444658]  RSP <ffff8800785db998>
      [   88.445232] ---[ end trace 6cb9d0487d94a29f ]---
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      c2227a39
  2. 18 7月, 2015 6 次提交
  3. 16 7月, 2015 1 次提交
    • D
      jfs: clean up jfs_rename and fix out of order unlock · 26456955
      Dave Kleikamp 提交于
      The end of jfs_rename(), which is also used by the error paths,
      included a call to IWRITE_UNLOCK(new_ip) after labels out1, out2
      and out3. If we come in through these labels, IWRITE_LOCK() has not
      been called yet.
      
      In moving that call to the correct spot, I also moved some
      exceptional truncate code earlier as well, since the early error
      paths don't need to deal with it, and I renamed out4: to out_tx: so
      a future patch by Jan Kara doesn't need to deal with renumbering or
      confusing out-of-order labels.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      26456955
  4. 14 7月, 2015 1 次提交
    • F
      Btrfs: fix file corruption after cloning inline extents · ed958762
      Filipe Manana 提交于
      Using the clone ioctl (or extent_same ioctl, which calls the same extent
      cloning function as well) we end up allowing copy an inline extent from
      the source file into a non-zero offset of the destination file. This is
      something not expected and that the btrfs code is not prepared to deal
      with - all inline extents must be at a file offset equals to 0.
      
      For example, the following excerpt of a test case for fstests triggers
      a crash/BUG_ON() on a write operation after an inline extent is cloned
      into a non-zero offset:
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test files. File foo has the same 2K of data at offset 4K
        # as file bar has at its offset 0.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
            -c "pwrite -S 0xbb 4k 2K" \
            -c "pwrite -S 0xcc 8K 4K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # File bar consists of a single inline extent (2K size).
        $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
           $SCRATCH_MNT/bar | _filter_xfs_io
      
        # Now call the clone ioctl to clone the extent of file bar into file
        # foo at its offset 4K. This made file foo have an inline extent at
        # offset 4K, something which the btrfs code can not deal with in future
        # IO operations because all inline extents are supposed to start at an
        # offset of 0, resulting in all sorts of chaos.
        # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
        # what it returns for other cases dealing with inlined extents.
        $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
            $SCRATCH_MNT/bar $SCRATCH_MNT/foo
      
        # Because of the inline extent at offset 4K, the following write made
        # the kernel crash with a BUG_ON().
        $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        status=0
        exit
      
      The stack trace of the BUG_ON() triggered by the last write is:
      
        [152154.035903] ------------[ cut here ]------------
        [152154.036424] kernel BUG at mm/page-writeback.c:2286!
        [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
        [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
        [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
        [152154.036424] RIP: 0010:[<ffffffff8111a9d5>]  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
        [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
        [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
        [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
        [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
        [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
        [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
        [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
        [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
        [152154.036424] Stack:
        [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
        [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
        [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
        [152154.036424] Call Trace:
        [152154.036424]  [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
        [152154.036424]  [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
        [152154.036424]  [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
        [152154.036424]  [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
        [152154.036424]  [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
        [152154.036424]  [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5
        [152154.036424]  [<ffffffff81165f89>] vfs_write+0xa0/0xe4
        [152154.036424]  [<ffffffff81166855>] SyS_pwrite64+0x64/0x82
        [152154.036424]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
        [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$
        [152154.036424] RIP  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
        [152154.036424]  RSP <ffff880429effc68>
        [152154.242621] ---[ end trace e3d3376b23a57041 ]---
      
      Fix this by returning the error EOPNOTSUPP if an attempt to copy an
      inline extent into a non-zero offset happens, just like what is done for
      other scenarios that would require copying/splitting inline extents,
      which were introduced by the following commits:
      
         00fdf13a ("Btrfs: fix a crash of clone with inline extents's split")
         3f9e3df8 ("btrfs: replace error code from btrfs_drop_extents")
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      ed958762
  5. 13 7月, 2015 5 次提交
  6. 12 7月, 2015 7 次提交
    • A
      freeing unlinked file indefinitely delayed · 75a6f82a
      Al Viro 提交于
      	Normally opening a file, unlinking it and then closing will have
      the inode freed upon close() (provided that it's not otherwise busy and
      has no remaining links, of course).  However, there's one case where that
      does *not* happen.  Namely, if you open it by fhandle with cold dcache,
      then unlink() and close().
      
      	In normal case you get d_delete() in unlink(2) notice that dentry
      is busy and unhash it; on the final dput() it will be forcibly evicted from
      dcache, triggering iput() and inode removal.  In this case, though, we end
      up with *two* dentries - disconnected (created by open-by-fhandle) and
      regular one (used by unlink()).  The latter will have its reference to inode
      dropped just fine, but the former will not - it's considered hashed (it
      is on the ->s_anon list), so it will stay around until the memory pressure
      will finally do it in.  As the result, we have the final iput() delayed
      indefinitely.  It's trivial to reproduce -
      
      void flush_dcache(void)
      {
              system("mount -o remount,rw /");
      }
      
      static char buf[20 * 1024 * 1024];
      
      main()
      {
              int fd;
              union {
                      struct file_handle f;
                      char buf[MAX_HANDLE_SZ];
              } x;
              int m;
      
              x.f.handle_bytes = sizeof(x);
              chdir("/root");
              mkdir("foo", 0700);
              fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
              close(fd);
              name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
              flush_dcache();
              fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
              unlink("foo/bar");
              write(fd, buf, sizeof(buf));
              system("df .");			/* 20Mb eaten */
              close(fd);
              system("df .");			/* should've freed those 20Mb */
              flush_dcache();
              system("df .");			/* should be the same as #2 */
      }
      
      will spit out something like
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 283282     21692  93% /
      - inode gets freed only when dentry is finally evicted (here we trigger
      than by remount; normally it would've happened in response to memory
      pressure hell knows when).
      
      Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      75a6f82a
    • A
      fix a braino in ovl_d_select_inode() · 9391dd00
      Al Viro 提交于
      when opening a directory we want the overlayfs inode, not one from
      the topmost layer.
      Reported-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Tested-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9391dd00
    • A
      9p: don't leave a half-initialized inode sitting around · 0a73d0a2
      Al Viro 提交于
      Cc: stable@vger.kernel.org # all branches
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a73d0a2
    • F
      Btrfs: fix order by which delayed references are run · cffc3374
      Filipe Manana 提交于
      When we have an extent that got N references removed and N new references
      added in the same transaction, we must run the insertion of the references
      first because otherwise the last removed reference will remove the extent
      item from the extent tree, resulting in a failure for the insertions.
      
      This is a regression introduced in the 4.2-rc1 release and this fix just
      brings back the behaviour of selecting reference additions before any
      reference removals.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create prealloc extent covering range [160K, 620K[
        $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
      
        # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
        # space that immediately follows it. This creates a new extent of 40K that spans
        # the range [620K, 660K[.
        $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now 2 back references to the prealloc extent in our
        # extent tree. Both are for our file offset 160K and one relates to a file
        # extent item with a data offset of 0 and a length of 380K, while the other
        # relates to a file extent item with a data offset of 380K and a length of 80K.
      
        # Make sure everything done so far is durably persisted (all back references are
        # in the extent tree, etc).
        sync
      
        # Now clone all extents of our file that cover the offset 160K up to its eof
        # (660K at this point) into itself at offset 2M. This leaves a hole in the file
        # covering the range [660K, 2M[. The prealloc extent will now be referenced by
        # the file twice, once for offset 160K and once for offset 2M. The 40K extent
        # that follows the prealloc extent will also be referenced twice by our file,
        # once for offset 620K and once for offset 2M + 460K.
        $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
      	$SCRATCH_MNT/foo
      
        # Now create one new extent in our file with a size of 100Kb. It will span the
        # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
        # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
        $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now (in memory) 4 back references to the prealloc
        # extent.
        #
        # Two of them are for file offset 160K, related to file extent items
        # matching the file offsets 160K and 540K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
        #
        # The other two references are for file offset 2M, related to file extent items
        # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
        #
        # The 40K extent has 2 back references, one for file offset 620K and the other
        # for file offset 2M + 460K.
        #
        # The 100K extent has a single back reference and it relates to file offset 3M.
      
        # Now clone our 100K extent into offset 600K. That offset covers the last 20K
        # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
        # offset 660K.
        $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
            $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # At this point there's only one reference to the 40K extent, at file offset
        # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
        # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
        # file offset 3M and a new one for file offset 600K).
      
        # Now fsync our file to make all its new data and metadata updates are durably
        # persisted and present if a power failure/crash happens after a successful
        # fsync and before the next transaction commit.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file contents.
        # During log replay, the btrfs delayed references implementation used to run the
        # deletion of back references before the addition of new back references, which
        # made the addition fail as it didn't find the key in the extent tree that it
        # was looking for. The failure triggered by this test was related to the 40K
        # extent, which got 1 reference dropped and 1 reference added during the fsync
        # log replay - when running the delayed references at transaction commit time,
        # btrfs was applying the deletion before the insertion, resulting in a failure
        # of the insertion that ended up turning the fs into read-only mode.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      
      This issue turned the filesystem into read-only mode (current transaction
      aborted) and produced the following traces:
      
        [ 8247.578385] ------------[ cut here ]------------
        [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
        (...)
        [ 8247.601697] Call Trace:
        [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
        [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
        [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
        [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
        [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        (...)
        [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
      
        [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
        [ 8247.728954] BTRFS: Transaction aborted (error -5)
        (...)
        [ 8247.760866] Call Trace:
        [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
        [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
        [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
        [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
        [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
        [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
        [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
        [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
        (...)
        [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
        [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
        [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Acked-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      cffc3374
    • F
      Btrfs: fix list transaction->pending_ordered corruption · d3efe084
      Filipe Manana 提交于
      When we call btrfs_commit_transaction(), we splice the list "ordered"
      of our transaction handle into the transaction's "pending_ordered"
      list, but we don't re-initialize the "ordered" list of our transaction
      handle, this means it still points to the same elements it used to
      before the splice. Then we check if the current transaction's state is
      >= TRANS_STATE_COMMIT_START and if it is we end up calling
      btrfs_end_transaction() which simply splices again the "ordered" list
      of our handle into the transaction's "pending_ordered" list, leaving
      multiple pointers to the same ordered extents which results in list
      corruption when we are iterating, removing and freeing ordered extents
      at btrfs_wait_pending_ordered(), resulting in access to dangling
      pointers / use-after-free issues.
      Similarly, btrfs_end_transaction() can end up in some cases calling
      btrfs_commit_transaction(), and both did a list splice of the transaction
      handle's "ordered" list into the transaction's "pending_ordered" without
      re-initializing the handle's "ordered" list, resulting in exactly the
      same problem.
      
      This produces the following warning on a kernel with linked list
      debugging enabled:
      
      [109749.265416] ------------[ cut here ]------------
      [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
      [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
      (...)
      [109749.287505] Call Trace:
      [109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
      [109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
      [109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
      [109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
      [109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
      [109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
      [109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
      [109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
      [109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
      [109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
      [109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
      [109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
      [109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
      [109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
      [109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
      [109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
      [109749.382274] ---[ end trace 48e0d07f7c03d95a ]---
      
      On a non-debug kernel this leads to invalid memory accesses, causing a
      crash. Fix this by using list_splice_init() instead of list_splice() in
      btrfs_commit_transaction() and btrfs_end_transaction().
      
      Cc: stable@vger.kernel.org
      Fixes: 50d9aa99 ("Btrfs: make sure logged extents complete in the current transaction V3"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      d3efe084
    • F
      Btrfs: fix memory leak in the extent_same ioctl · 497b4050
      Filipe Manana 提交于
      We were allocating memory with memdup_user() but we were never releasing
      that memory. This affected pretty much every call to the ioctl, whether
      it deduplicated extents or not.
      
      This issue was reported on IRC by Julian Taylor and on the mailing list
      by Marcel Ritter, credit goes to them for finding the issue.
      Reported-by: NJulian Taylor <jtaylor.debian@googlemail.com>
      Reported-by: NMarcel Ritter <ritter.marcel@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      497b4050
    • F
      Btrfs: fix shrinking truncate when the no_holes feature is enabled · c1aa4575
      Filipe Manana 提交于
      If the no_holes feature is enabled, we attempt to shrink a file to a size
      that ends up in the middle of a hole and we don't have any file extent
      items in the fs/subvol tree that go beyond the new file size (or any
      ordered extents that will insert such file extent items), we end up not
      updating the inode's disk_i_size, we only update the inode's i_size.
      
      This means that after unmounting and mounting the filesystem, or after
      the inode is evicted and reloaded, its i_size ends up being incorrect
      (an inode's i_size is set to the disk_i_size field when an inode is
      loaded). This happens when btrfs_truncate_inode_items() doesn't find
      any file extent items to drop - in this case it never makes a call to
      btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
      
      Example reproducer:
      
        $ mkfs.btrfs -O no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create our test file with some data and durably persist it.
        $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
        $ sync
      
        # Append some data to the file, increasing its size, and leave a hole
        # between the old size and the start offset if the following write. So
        # our file gets a hole in the range [128Kb, 256Kb[.
        $ xfs_io -c "truncate 160K" /mnt/foo
      
        # We expect to see our file with a size of 160Kb, with the first 128Kb
        # of data all having the value 0xaa and the remaining 32Kb of data all
        # having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0500000
      
        # Now cleanly unmount and mount again the filesystem.
        $ umount /mnt
        $ mount /dev/sdd /mnt
      
        # We expect to get the same result as before, a file with a size of
        # 160Kb, with the first 128Kb of data all having the value 0xaa and the
        # remaining 32Kb of data all having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000
      
      In the example above the file size/data do not match what they were before
      the remount.
      
      Fix this by always calling btrfs_ordered_update_i_size() with a size
      matching the size the file was truncated to if btrfs_truncate_inode_items()
      is not called for a log tree and no file extent items were dropped. This
      ensures the same behaviour as when the no_holes feature is not enabled.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c1aa4575
  7. 10 7月, 2015 5 次提交
  8. 06 7月, 2015 1 次提交
  9. 05 7月, 2015 3 次提交
  10. 04 7月, 2015 4 次提交
    • N
      sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y · 5968cece
      Naveen N. Rao 提交于
      Expand /proc/pid/schedstat output:
      
       - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels.
      
       - dump all zeroes on kernels that are booted with the 'nodelayacct'
         option, which boot option disables delay accounting on
         CONFIG_TASK_DELAY_ACCT=y kernels.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: a.p.zijlstra@chello.nl
      Cc: ricklind@us.ibm.com
      Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5968cece
    • E
      ext4: correctly migrate a file with a hole at the beginning · 8974fec7
      Eryu Guan 提交于
      Currently ext4_ind_migrate() doesn't correctly handle a file which
      contains a hole at the beginning of the file.  This caused the migration
      to be done incorrectly, and then if there is a subsequent following
      delayed allocation write to the "hole", this would reclaim the same data
      blocks again and results in fs corruption.
      
        # assmuing 4k block size ext4, with delalloc enabled
        # skip the first block and write to the second block
        xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile
      
        # converting to indirect-mapped file, which would move the data blocks
        # to the beginning of the file, but extent status cache still marks
        # that region as a hole
        chattr -e /mnt/ext4/testfile
      
        # delayed allocation writes to the "hole", reclaim the same data block
        # again, results in i_blocks corruption
        xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
        umount /mnt/ext4
        e2fsck -nf /dev/sda6
        ...
        Inode 53, i_blocks is 16, should be 8.  Fix? no
        ...
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      8974fec7
    • E
      ext4: be more strict when migrating to non-extent based file · d6f123a9
      Eryu Guan 提交于
      Currently the check in ext4_ind_migrate() is not enough before doing the
      real conversion:
      
      a) delayed allocated extents could bypass the check on eh->eh_entries
         and eh->eh_depth
      
      This can be demonstrated by this script
      
        xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
      
      where testfile has two extents but still be converted to non-extent
      based file format.
      
      b) only extent length is checked but not the offset, which would result
         in data lose (delalloc) or fs corruption (nodelalloc), because
         non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
         blocks
      
      This can be demostrated by
      
        xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
        chattr -e /mnt/ext4/testfile
        sync
      
      If delalloc is enabled, dmesg prints
        EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
        EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
        EXT4-fs (dm-4): This should not happen!! Data will be lost
      
      If delalloc is disabled, e2fsck -nf shows corruption
        Inode 53, i_size is 5497558142976, should be 4096.  Fix? no
      
      Fix the two issues by
      
      a) forcing all delayed allocation blocks to be allocated before checking
         eh->eh_depth and eh->eh_entries
      b) limiting the last logical block of the extent is within direct map
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d6f123a9
    • L
      ext4: fix reservation release on invalidatepage for delalloc fs · 9705acd6
      Lukas Czerner 提交于
      On delalloc enabled file system on invalidatepage operation
      in ext4_da_page_release_reservation() we want to clear the delayed
      buffer and remove the extent covering the delayed buffer from the extent
      status tree.
      
      However currently there is a bug where on the systems with page size >
      block size we will always remove extents from the start of the page
      regardless where the actual delayed buffers are positioned in the page.
      This leads to the errors like this:
      
      EXT4-fs warning (device loop0): ext4_da_release_space:1225:
      ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
      blocks
      
      This however can cause data loss on writeback time if the file system is
      in ENOSPC condition because we're releasing reservation for someones
      else delayed buffer.
      
      Fix this by only removing extents that corresponds to the part of the
      page we want to invalidate.
      
      This problem is reproducible by the following fio receipt (however I was
      only able to reproduce it with fio-2.1 or older.
      
      [global]
      bs=8k
      iodepth=1024
      iodepth_batch=60
      randrepeat=1
      size=1m
      directory=/mnt/test
      numjobs=20
      [job1]
      ioengine=sync
      bs=1k
      direct=1
      rw=randread
      filename=file1:file2
      [job2]
      ioengine=libaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job3]
      bs=1k
      ioengine=posixaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job5]
      bs=1k
      ioengine=sync
      rw=randread
      filename=file1:file2
      [job7]
      ioengine=libaio
      rw=randwrite
      filename=file1:file2
      [job8]
      ioengine=posixaio
      rw=randwrite
      filename=file1:file2
      [job10]
      ioengine=mmap
      rw=randwrite
      bs=1k
      filename=file1:file2
      [job11]
      ioengine=mmap
      rw=randwrite
      direct=1
      filename=file1:file2
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      9705acd6