1. 03 4月, 2017 4 次提交
    • D
      ext4: Add statx support · 99652ea5
      David Howells 提交于
      Return enhanced file attributes from the Ext4 filesystem.  This includes
      the following:
      
       (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.
      
       (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.
      
      This requires that all ext4 inodes have a getattr call, not just some of
      them, so to this end, split the ext4_getattr() function and only call part
      of it where appropriate.
      
      Example output:
      
      	[root@andromeda ~]# touch foo
      	[root@andromeda ~]# chattr +ai foo
      	[root@andromeda ~]# /tmp/test-statx foo
      	statx(foo) = 0
      	results=fff
      	  Size: 0               Blocks: 0          IO Block: 4096    regular file
      	Device: 08:12           Inode: 2101950     Links: 1
      	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
      	Access: 2016-02-11 17:08:29.031795451+0000
      	Modify: 2016-02-11 17:08:29.031795451+0000
      	Change: 2016-02-11 17:11:11.987790114+0000
      	 Birth: 2016-02-11 17:08:29.031795451+0000
      	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99652ea5
    • E
      statx: optimize copy of struct statx to userspace · 64bd7204
      Eric Biggers 提交于
      I found that statx() was significantly slower than stat().  As a
      microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs
      file to the same with statx() passed a NULL path:
      
      	$ time ./stat_benchmark
      
      	real	0m1.464s
      	user	0m0.275s
      	sys	0m1.187s
      
      	$ time ./statx_benchmark
      
      	real	0m5.530s
      	user	0m0.281s
      	sys	0m5.247s
      
      statx is expected to be a little slower than stat because struct statx
      is larger than struct stat, but not by *that* much.  It turns out that
      most of the overhead was in copying struct statx to userspace, mostly in
      all the stac/clac instructions that got generated for each __put_user()
      call.  (This was on x86_64, but some other architectures, e.g. arm64,
      have something similar now too.)
      
      stat() instead initializes its struct on the stack and copies it to
      userspace with a single call to copy_to_user().  This turns out to be
      much faster, and changing statx to do this makes it almost as fast as
      stat:
      
      	$ time ./statx_benchmark
      
      	real	0m1.624s
      	user	0m0.270s
      	sys	0m1.354s
      
      For zeroing the reserved fields, start by zeroing the full struct with
      memset.  This makes it clear that every byte copied to userspace is
      initialized, even implicit padding bytes (though there are none
      currently).  In the scenarios I tested, it also performed the same as a
      designated initializer.  Manually initializing each field was still
      slightly faster, but would have been more error-prone and less
      verifiable.
      
      Also rename statx_set_result() to cp_statx() for consistency with
      cp_old_stat() et al., and make it noinline so that struct statx doesn't
      add to the stack usage during the main portion of the syscall execution.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64bd7204
    • E
      statx: remove incorrect part of vfs_statx() comment · b15fb70b
      Eric Biggers 提交于
      request_mask and query_flags are function arguments, not passed in
      struct kstat.  So remove the part of the comment which claims otherwise.
      This was apparently left over from an earlier version of the statx
      patch.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b15fb70b
    • E
      statx: reject unknown flags when using NULL path · 8c7493aa
      Eric Biggers 提交于
      The statx() system call currently accepts unknown flags when called with
      a NULL path to operate on a file descriptor.  Left unchanged, this could
      make it hard to introduce new query flags in the future, since
      applications may not be able to tell whether a given flag is supported.
      
      Fix this by failing the system call with EINVAL if any flags other than
      KSTAT_QUERY_FLAGS are specified in combination with a NULL path.
      
      Arguably, we could still permit known lookup-related flags such as
      AT_SYMLINK_NOFOLLOW.  However, that would be inconsistent with how
      sys_utimensat() behaves when passed a NULL path, which seems to be the
      closest precedent.  And given that the NULL path case is (I believe)
      mainly intended to be used to implement a wrapper function like fstatx()
      that doesn't have a path argument, I think rejecting lookup-related
      flags too is probably the best choice.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8c7493aa
  2. 01 4月, 2017 3 次提交
    • M
      hugetlbfs: initialize shared policy as part of inode allocation · 4742a35d
      Mike Kravetz 提交于
      Any time after inode allocation, destroy_inode can be called.  The
      hugetlbfs inode contains a shared_policy structure, and
      mpol_free_shared_policy is unconditionally called as part of
      hugetlbfs_destroy_inode.  Initialize the policy as part of inode
      allocation so that any quick (error path) calls to destroy_inode will be
      handed an initialized policy.
      
      syzkaller fuzzer found this bug, that resulted in the following:
      
          BUG: KASAN: user-memory-access in atomic_inc
          include/asm-generic/atomic-instrumented.h:87 [inline] at addr
          000000131730bd7a
          BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
          kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
          Write of size 4 by task syz-executor6/14086
          CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
           __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
           lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
           __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
           _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
           mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
           hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
           alloc_inode+0x10d/0x180 fs/inode.c:216
           new_inode_pseudo+0x69/0x190 fs/inode.c:889
           new_inode+0x1c/0x40 fs/inode.c:918
           hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
           hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
           newseg+0x422/0xd30 ipc/shm.c:575
           ipcget_new ipc/util.c:285 [inline]
           ipcget+0x21e/0x580 ipc/util.c:639
           SYSC_shmget ipc/shm.c:673 [inline]
           SyS_shmget+0x158/0x230 ipc/shm.c:657
           entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      Analysis provided by Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      
      Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4742a35d
    • T
      nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type · f17f8a14
      Tigran Mkrtchyan 提交于
      this fix aims to fix dereferencing of a mirror in an error state when MDS
      returns unsupported DS type (IOW, not v3), which causes the following oops:
      
      [  220.370709] BUG: unable to handle kernel NULL pointer dereference at 0000000000000065
      [  220.370842] IP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
      [  220.370920] PGD 0
      
      [  220.370972] Oops: 0000 [#1] SMP
      [  220.371013] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security ebtable_filter ebtables ip6table_filter ip6_tables binfmt_misc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel btrfs kvm arc4 snd_hda_codec_hdmi iwldvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate mac80211 xor uvcvideo
      [  220.371814]  videobuf2_vmalloc videobuf2_memops snd_hda_codec_idt mei_wdt videobuf2_v4l2 snd_hda_codec_generic iTCO_wdt ppdev videobuf2_core iTCO_vendor_support dell_rbtn dell_wmi iwlwifi sparse_keymap dell_laptop dell_smbios snd_hda_intel dcdbas videodev snd_hda_codec dell_smm_hwmon snd_hda_core media cfg80211 intel_uncore snd_hwdep raid6_pq snd_seq intel_rapl_perf snd_seq_device joydev i2c_i801 rfkill lpc_ich snd_pcm parport_pc mei_me parport snd_timer dell_smo8800 mei snd shpchp soundcore tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 nouveau mxm_wmi ttm i2c_algo_bit drm_kms_helper crc32c_intel e1000e drm sdhci_pci firewire_ohci sdhci serio_raw mmc_core firewire_core ptp crc_itu_t pps_core wmi fjes video
      [  220.372568] CPU: 7 PID: 4988 Comm: cat Not tainted 4.10.5-200.fc25.x86_64 #1
      [  220.372647] Hardware name: Dell Inc. Latitude E6520/0J4TFW, BIOS A06 07/11/2011
      [  220.372729] task: ffff94791f6ea580 task.stack: ffffb72b88c0c000
      [  220.372802] RIP: 0010:ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles]
      [  220.372883] RSP: 0018:ffffb72b88c0f970 EFLAGS: 00010246
      [  220.372945] RAX: 0000000000000000 RBX: ffff9479015ca600 RCX: ffffffffffffffed
      [  220.373025] RDX: ffffffffffffffed RSI: ffff9479753dc980 RDI: 0000000000000000
      [  220.373104] RBP: ffffb72b88c0f988 R08: 000000000001c980 R09: ffffffffc0ea6112
      [  220.373184] R10: ffffef17477d9640 R11: ffff9479753dd6c0 R12: ffff9479211c7440
      [  220.373264] R13: ffff9478f45b7790 R14: 0000000000000001 R15: ffff9479015ca600
      [  220.373345] FS:  00007f555fa3e700(0000) GS:ffff9479753c0000(0000) knlGS:0000000000000000
      [  220.373435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  220.373506] CR2: 0000000000000065 CR3: 0000000196044000 CR4: 00000000000406e0
      [  220.373586] Call Trace:
      [  220.373627]  nfs4_ff_layout_prepare_ds+0x5e/0x200 [nfs_layout_flexfiles]
      [  220.373708]  ff_layout_pg_init_read+0x81/0x160 [nfs_layout_flexfiles]
      [  220.373806]  __nfs_pageio_add_request+0x11f/0x4a0 [nfs]
      [  220.373886]  ? nfs_create_request.part.14+0x37/0x330 [nfs]
      [  220.373967]  nfs_pageio_add_request+0xb2/0x260 [nfs]
      [  220.374042]  readpage_async_filler+0xaf/0x280 [nfs]
      [  220.374103]  read_cache_pages+0xef/0x1b0
      [  220.374166]  ? nfs_read_completion+0x210/0x210 [nfs]
      [  220.374239]  nfs_readpages+0x129/0x200 [nfs]
      [  220.374293]  __do_page_cache_readahead+0x1d0/0x2f0
      [  220.374352]  ondemand_readahead+0x17d/0x2a0
      [  220.374403]  page_cache_sync_readahead+0x2e/0x50
      [  220.374460]  generic_file_read_iter+0x6c8/0x950
      [  220.374532]  ? nfs_mapping_need_revalidate_inode+0x17/0x40 [nfs]
      [  220.374617]  nfs_file_read+0x6e/0xc0 [nfs]
      [  220.374670]  __vfs_read+0xe2/0x150
      [  220.374715]  vfs_read+0x96/0x130
      [  220.374758]  SyS_read+0x55/0xc0
      [  220.374801]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [  220.374856] RIP: 0033:0x7f555f570bd0
      [  220.374900] RSP: 002b:00007ffeb73e1b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [  220.374986] RAX: ffffffffffffffda RBX: 00007f555f839ae0 RCX: 00007f555f570bd0
      [  220.375066] RDX: 0000000000020000 RSI: 00007f555fa41000 RDI: 0000000000000003
      [  220.375145] RBP: 0000000000021010 R08: ffffffffffffffff R09: 0000000000000000
      [  220.375226] R10: 00007f555fa40010 R11: 0000000000000246 R12: 0000000000022000
      [  220.375305] R13: 0000000000021010 R14: 0000000000001000 R15: 0000000000002710
      [  220.375386] Code: 66 66 90 55 48 89 e5 41 54 53 49 89 fc 48 83 ec 08 48 85 f6 74 2e 48 8b 4e 30 48 89 f3 48 81 f9 00 f0 ff ff 77 1e 48 85 c9 74 15 <48> 83 79 78 00 b8 01 00 00 00 74 2c 48 83 c4 08 5b 41 5c 5d c3
      [  220.375653] RIP: ff_layout_mirror_valid+0x2d/0x110 [nfs_layout_flexfiles] RSP: ffffb72b88c0f970
      [  220.375748] CR2: 0000000000000065
      [  220.403538] ---[ end trace bcdca752211b7da9 ]---
      Signed-off-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      f17f8a14
    • O
      NFSv4.1 fix infinite loop on IO BAD_STATEID error · 0e3d3e5d
      Olga Kornievskaia 提交于
      Commit 63d63cbf "NFSv4.1: Don't recheck delegations that
      have already been checked" introduced a regression where when a
      client received BAD_STATEID error it would not send any TEST_STATEID
      and instead go into an infinite loop of resending the IO that caused
      the BAD_STATEID.
      
      Fixes: 63d63cbf ("NFSv4.1: Don't recheck delegations that have already been checked")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Cc: stable@vger.kernel.org # 4.9+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0e3d3e5d
  3. 31 3月, 2017 1 次提交
  4. 29 3月, 2017 3 次提交
    • D
      Btrfs: fix an integer overflow check · 457ae726
      Dan Carpenter 提交于
      This isn't super serious because you need CAP_ADMIN to run this code.
      
      I added this integer overflow check last year but apparently I am
      rubbish at writing integer overflow checks...  There are two issues.
      First, access_ok() works on unsigned long type and not u64 so on 32 bit
      systems the access_ok() could be checking a truncated size.  The other
      issue is that we should be using a stricter limit so we don't overflow
      the kzalloc() setting ctx->clone_roots later in the function after the
      access_ok():
      
      	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
      	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
      
      Fixes: f5ecec3c ("btrfs: send: silence an integer overflow warning")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      457ae726
    • G
      btrfs: Change qgroup_meta_rsv to 64bit · ce0dcee6
      Goldwyn Rodrigues 提交于
      Using an int value is causing qg->reserved to become negative and
      exclusive -EDQUOT to be reached prematurely.
      
      This affects exclusive qgroups only.
      
      TEST CASE:
      
      DEVICE=/dev/vdb
      MOUNTPOINT=/mnt
      SUBVOL=$MOUNTPOINT/tmp
      
      umount $SUBVOL
      umount $MOUNTPOINT
      
      mkfs.btrfs -f $DEVICE
      mount /dev/vdb $MOUNTPOINT
      btrfs quota enable $MOUNTPOINT
      btrfs subvol create $SUBVOL
      umount $MOUNTPOINT
      mount /dev/vdb $MOUNTPOINT
      mount -o subvol=tmp $DEVICE $SUBVOL
      btrfs qgroup limit -e 3G $SUBVOL
      
      btrfs quota rescan /mnt -w
      
      for i in `seq 1 44000`; do
        dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
        if [[ $? > 0 ]]; then
           btrfs qgroup show -pcref $SUBVOL
           exit 1
        fi
      done
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      [ add reproducer to changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce0dcee6
    • L
      Btrfs: bring back repair during read · 9d0d1c8b
      Liu Bo 提交于
      Commit 20a7db8a ("btrfs: add dummy callback for readpage_io_failed
      and drop checks") made a cleanup around readpage_io_failed_hook, and
      it was supposed to keep the original sematics, but it also
      unexpectedly disabled repair during read for dup, raid1 and raid10.
      
      This fixes the problem by letting data's inode call the generic
      readpage_io_failed callback by returning -EAGAIN from its
      readpage_io_failed_hook in order to notify end_bio_extent_readpage to
      do the rest.  We don't call it directly because the generic one takes
      an offset from end_bio_extent_readpage() to calculate the index in the
      checksum array and inode's readpage_io_failed_hook doesn't offer that
      offset.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ keep the const function attribute ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d0d1c8b
  5. 28 3月, 2017 4 次提交
  6. 26 3月, 2017 2 次提交
  7. 20 3月, 2017 5 次提交
    • C
      f2fs: combine nat_bits and free_nid_bitmap cache · 7041d5d2
      Chao Yu 提交于
      Both nat_bits cache and free_nid_bitmap cache provide same functionality
      as a intermediate cache between free nid cache and disk, but with
      different granularity of indicating free nid range, and different
      persistence policy. nat_bits cache provides better persistence ability,
      and free_nid_bitmap provides better granularity.
      
      In this patch we combine advantage of both caches, so finally policy of
      the intermediate cache would be:
      - init: load free nid status from nat_bits into free_nid_bitmap
      - lookup: scan free_nid_bitmap before load NAT blocks
      - update: update free_nid_bitmap in real-time
      - persistence: udpate and persist nat_bits in checkpoint
      
      This patch also resolves performance regression reported by lkp-robot.
      
      commit:
        4ac91242 ("f2fs: introduce free nid bitmap")
        d00030cf9cd0bb96fdccc41e33d3c91dcbb672ba ("f2fs: use __set{__clear}_bit_le")
        1382c0f3f9d3f936c8bc42ed1591cf7a593ef9f7 ("f2fs: combine nat_bits and free_nid_bitmap cache")
      
      4ac91242 d00030cf9cd0bb96fdccc41e33 1382c0f3f9d3f936c8bc42ed15
      ---------------- -------------------------- --------------------------
               %stddev     %change         %stddev     %change         %stddev
                   \          |                \          |                \
           77863 ±  0%      +2.1%      79485 ±  1%     +50.8%     117404 ±  0%  aim7.jobs-per-min
          231.63 ±  0%      -2.0%     227.01 ±  1%     -33.6%     153.80 ±  0%  aim7.time.elapsed_time
          231.63 ±  0%      -2.0%     227.01 ±  1%     -33.6%     153.80 ±  0%  aim7.time.elapsed_time.max
          896604 ±  0%      -0.8%     889221 ±  3%     -20.2%     715260 ±  1%  aim7.time.involuntary_context_switches
            2394 ±  1%      +4.6%       2503 ±  1%      +3.7%       2481 ±  2%  aim7.time.maximum_resident_set_size
            6240 ±  0%      -1.5%       6145 ±  1%     -14.1%       5360 ±  1%  aim7.time.system_time
         1111357 ±  3%      +1.9%    1132509 ±  2%      -6.2%    1041932 ±  2%  aim7.time.voluntary_context_switches
      ...
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      7041d5d2
    • C
      f2fs: skip scanning free nid bitmap of full NAT blocks · 586d1492
      Chao Yu 提交于
      This patch adds to account free nids for each NAT blocks, and while
      scanning all free nid bitmap, do check count and skip lookuping in
      full NAT block.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      586d1492
    • J
      f2fs: use __set{__clear}_bit_le · 23380b85
      Jaegeuk Kim 提交于
      This patch uses __set{__clear}_bit_le for highter speed.
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      23380b85
    • J
      f2fs: declare static functions · 9f7e4a2c
      Jaegeuk Kim 提交于
      This is to avoid build warning reported by kbuild test robot.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      9f7e4a2c
    • J
      f2fs: don't overwrite node block by SSR · 720037f9
      Jaegeuk Kim 提交于
      This patch fixes that SSR can overwrite previous warm node block consisting of
      a node chain since the last checkpoint.
      
      Fixes: 5b6c6be2 ("f2fs: use SSR for warm node as well")
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      720037f9
  8. 18 3月, 2017 8 次提交
    • Z
      btrfs: add missing memset while reading compressed inline extents · e1699d2d
      Zygo Blaxell 提交于
      This is a story about 4 distinct (and very old) btrfs bugs.
      
      Commit c8b97818 ("Btrfs: Add zlib compression support") added
      three data corruption bugs for inline extents (bugs #1-3).
      
      Commit 93c82d57 ("Btrfs: zero page past end of inline file items")
      fixed bug #1:  uncompressed inline extents followed by a hole and more
      extents could get non-zero data in the hole as they were read.  The fix
      was to add a memset in btrfs_get_extent to zero out the hole.
      
      Commit 166ae5a4 ("btrfs: fix inline compressed read err corruption")
      fixed bug #2:  compressed inline extents which contained non-zero bytes
      might be replaced with zero bytes in some cases.  This patch removed an
      unhelpful memset from uncompress_inline, but the case where memset is
      required was missed.
      
      There is also a memset in the decompression code, but this only covers
      decompressed data that is shorter than the ram_bytes from the extent
      ref record.  This memset doesn't cover the region between the end of the
      decompressed data and the end of the page.  It has also moved around a
      few times over the years, so there's no single patch to refer to.
      
      This patch fixes bug #3:  compressed inline extents followed by a hole
      and more extents could get non-zero data in the hole as they were read
      (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
      The fix is the same:  zero out the hole in the compressed case too,
      by putting a memset back in uncompress_inline, but this time with
      correct parameters.
      
      The last and oldest bug, bug #0, is the cause of the offending inline
      extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
      of behavior somewhere in the btrfs write code.  In a few special cases,
      an inline extent and hole are allowed to persist where they normally
      would be combined with later extents in the file.
      
      A fast reproducer for bug #0 is presented below.  A few offending extents
      are also created in the wild during large rsync transfers with the -S
      flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
      will produce a handful of offending files as well.  Once an offending
      file is created, it can present different content to userspace each
      time it is read.
      
      Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
      kernel back to v3.5 has this behavior.  There are fossil records of this
      bug's effects in commits all the way back to v2.6.32.  I have no reason
      to believe bug #0 wasn't present at the beginning of btrfs compression
      support in v2.6.29, but I can't easily test kernels that old to be sure.
      
      It is not clear whether bug #0 is worth fixing.  A fix would likely
      require injecting extra reads into currently write-only paths, and most
      of the exceptional cases caused by bug #0 are already handled now.
      
      Whether we like them or not, bug #0's inline extents followed by holes
      are part of the btrfs de-facto disk format now, and we need to be able
      to read them without data corruption or an infoleak.  So enough about
      bug #0, let's get back to bug #3 (this patch).
      
      An example of on-disk structure leading to data corruption found in
      the wild:
      
              item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                      inode generation 50 transid 50 size 47424 nbytes 49141
                      block group 0 mode 100644 links 1 uid 0 gid 0
                      rdev 0 flags 0x0(none)
              item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                      inode ref index 3 namelen 10 name: DB_File.so
              item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                      inline extent data size 1341 ram 4085 compress(zlib)
              item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                      extent data disk byte 5367308288 nr 20480
                      extent data offset 0 nr 45056 ram 45056
                      extent compression(zlib)
      
      Different data appears in userspace during each read of the 11 bytes
      between 4085 and 4096.  The extent in item 63 is not long enough to
      fill the first page of the file, so a memset is required to fill the
      space between item 63 (ending at 4085) and item 64 (beginning at 4096)
      with zero.
      
      Here is a reproducer from Liu Bo, which demonstrates another method
      of creating the same inline extent and hole pattern:
      
      Using 'page_poison=on' kernel command line (or enable
      CONFIG_PAGE_POISONING) run the following:
      
      	# touch foo
      	# chattr +c foo
      	# xfs_io -f -c "pwrite -W 0 1000" foo
      	# xfs_io -f -c "falloc 4 8188" foo
      	# od -x foo
      	# echo 3 >/proc/sys/vm/drop_caches
      	# od -x foo
      
      This produce the following on my box:
      
      Correct output:  file contains 1000 data bytes followed
      by zeros:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
      	0001760 0000 0000 0000 0000 0000 0000 0000 0000
      	*
      	0020000
      
      Actual output:  the data after the first 1000 bytes
      will be different each run:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
      	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
      	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
      	(...)
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e1699d2d
    • L
      Btrfs: fix regression in lock_delalloc_pages · 49d4a334
      Liu Bo 提交于
      The bug is a regression after commit
      (da2c7009 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
      and commit
      (76c0021d "Btrfs: use helper to simplify lock/unlock pages").
      
      So if the dirty pages which are under writeback got truncated partially
      before we lock the dirty pages, we couldn't find all pages mapping to the
      delalloc range, and the bug didn't return an error so it kept going on and
      found that the delalloc range got truncated and got to unlock the dirty
      pages, and then the ASSERT could caught the error, and showed
      
      -----------------------------------------------------------------------------
      assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
      -----------------------------------------------------------------------------
      
      This fixes the bug by returning the proper -EAGAIN.
      
      Cc: David Sterba <dsterba@suse.com>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49d4a334
    • W
      pNFS/flexfiles: never nfs4_mark_deviceid_unavailable · da066f3f
      Weston Andros Adamson 提交于
      The flexfiles layout should never mark a device unavailable.
      
      Move nfs4_mark_deviceid_unavailable out of nfs4_pnfs_ds_connect and call
      directly from files layout where it's still needed.
      
      The flexfiles driver still handles marked devices in error paths, but will
      now print a rate limited warning.
      Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      da066f3f
    • W
      pNFS: return status from nfs4_pnfs_ds_connect · a33e4b03
      Weston Andros Adamson 提交于
      The nfs4_pnfs_ds_connect path can call rpc_create which can fail or it
      can wait on another context to reach the same failure.
      
      This checks that the rpc_create succeeded and returns the error to the
      caller.
      
      When an error is returned, both the files and flexfiles layouts will return
      NULL from _prepare_ds(). The flexfiles layout will also return the layout
      with the error NFS4ERR_NXIO.
      Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a33e4b03
    • O
      NFSv4.1 respect server's max size in CREATE_SESSION · 03385332
      Olga Kornievskaia 提交于
      Currently client doesn't respect max sizes server returns in CREATE_SESSION.
      nfs4_session_set_rwsize() gets called and server->rsize, server->wsize are 0
      so they never get set to the sizes returned by the server.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      03385332
    • O
      NFS prevent double free in async nfs4_exchange_id · 63513232
      Olga Kornievskaia 提交于
      Since rpc_task is async, the release function should be called which
      will free the impl_id, scope, and owner.
      
      Trond pointed at 2 more problems:
      -- use of client pointer after free in the nfs4_exchangeid_release() function
      -- cl_count mismatch if rpc_run_task() isn't run
      
      Fixes: 8d89bd70 ("NFS setup async exchange_id")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Cc: stable@vger.kernel.org # 4.9
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      63513232
    • J
      nfs: make nfs4_cb_sv_ops static · 05fae7bb
      Jason Yan 提交于
      Fixes the following sparse warning:
      
      fs/nfs/callback.c:235:21: warning: symbol 'nfs4_cb_sv_ops' was not
      declared. Should it be static?
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      05fae7bb
    • K
      NFS: fix the fault nrequests decreasing for nfs_inode COPY · 38a33101
      Kinglong Mee 提交于
      The nfs_commit_file for NFSv4.2's COPY operation goes through
      the commit path for normal WRITE, but without increase nrequests,
      so, the nrequests decreased in nfs_commit_release_pages is fault.
      After that, the nrequests will be wrong.
      
      [ 5670.299881] ------------[ cut here ]------------
      [ 5670.300295] WARNING: CPU: 0 PID: 27656 at fs/nfs/inode.c:127 nfs_clear_inode+0x66/0x90 [nfs]
      [ 5670.300558] Modules linked in: nfsv4(E) nfs(E) fscache(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event ppdev f2fs coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_ens1371 intel_rapl_perf gameport snd_ac97_codec vmw_balloon ac97_bus snd_seq snd_pcm joydev snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm e1000 crc32c_intel mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: fscache]
      [ 5670.302925] CPU: 0 PID: 27656 Comm: umount.nfs4 Tainted: G        W   E   4.11.0-rc1+ #519
      [ 5670.303292] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [ 5670.304094] Call Trace:
      [ 5670.304510]  dump_stack+0x63/0x86
      [ 5670.304917]  __warn+0xcb/0xf0
      [ 5670.305276]  warn_slowpath_null+0x1d/0x20
      [ 5670.305661]  nfs_clear_inode+0x66/0x90 [nfs]
      [ 5670.306093]  nfs4_evict_inode+0x61/0x70 [nfsv4]
      [ 5670.306480]  evict+0xbb/0x1c0
      [ 5670.306888]  dispose_list+0x4d/0x70
      [ 5670.307233]  evict_inodes+0x178/0x1a0
      [ 5670.307579]  generic_shutdown_super+0x44/0xf0
      [ 5670.307985]  nfs_kill_super+0x21/0x40 [nfs]
      [ 5670.308325]  deactivate_locked_super+0x43/0x70
      [ 5670.308698]  deactivate_super+0x5a/0x60
      [ 5670.309036]  cleanup_mnt+0x3f/0x90
      [ 5670.309407]  __cleanup_mnt+0x12/0x20
      [ 5670.309837]  task_work_run+0x80/0xa0
      [ 5670.310162]  exit_to_usermode_loop+0x89/0x90
      [ 5670.310497]  syscall_return_slowpath+0xaa/0xb0
      [ 5670.310875]  entry_SYSCALL_64_fastpath+0xa7/0xa9
      [ 5670.311197] RIP: 0033:0x7f1bb3617fe7
      [ 5670.311545] RSP: 002b:00007ffecbabb828 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6
      [ 5670.311906] RAX: 0000000000000000 RBX: 0000000001dca1f0 RCX: 00007f1bb3617fe7
      [ 5670.312239] RDX: 000000000000000c RSI: 0000000000000001 RDI: 0000000001dc83c0
      [ 5670.312653] RBP: 0000000001dc83c0 R08: 0000000000000001 R09: 0000000000000000
      [ 5670.312998] R10: 0000000000000755 R11: 0000000000000206 R12: 00007ffecbabc66a
      [ 5670.313335] R13: 0000000001dc83a0 R14: 0000000000000000 R15: 0000000000000000
      [ 5670.313758] ---[ end trace bf4bfe7764e4eb40 ]---
      
      Cc: linux-kernel@vger.kernel.org
      Fixes: 67911c8f ("NFS: Add nfs_commit_file()")
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      38a33101
  9. 17 3月, 2017 10 次提交
    • V
      kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file() · 966fa72a
      Vaibhav Jain 提交于
      Recently started seeing a kernel oops when a module tries removing a
      memory mapped sysfs bin_attribute. On closer investigation the root
      cause seems to be kernfs_release_file() trying to call
      kernfs_op.release() callback that's NULL for such sysfs
      bin_attributes. The oops occurs when kernfs_release_file() is called from
      kernfs_drain_open_files() to cleanup any open handles with active
      memory mappings.
      
      The patch fixes this by checking for flag KERNFS_HAS_RELEASE before
      calling kernfs_release_file() in function kernfs_drain_open_files().
      
      On ppc64-le arch with cxl module the oops back-trace is of the
      form below:
      [  861.381126] Unable to handle kernel paging request for instruction fetch
      [  861.381360] Faulting instruction address: 0x00000000
      [  861.381428] Oops: Kernel access of bad area, sig: 11 [#1]
      ....
      [  861.382481] NIP: 0000000000000000 LR: c000000000362c60 CTR:
      0000000000000000
      ....
      Call Trace:
      [c000000f1680b750] [c000000000362c34] kernfs_drain_open_files+0x104/0x1d0 (unreliable)
      [c000000f1680b790] [c00000000035fa00] __kernfs_remove+0x260/0x2c0
      [c000000f1680b820] [c000000000360da0] kernfs_remove_by_name_ns+0x60/0xe0
      [c000000f1680b8b0] [c0000000003638f4] sysfs_remove_bin_file+0x24/0x40
      [c000000f1680b8d0] [c00000000062a164] device_remove_bin_file+0x24/0x40
      [c000000f1680b8f0] [d000000009b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl]
      [c000000f1680b940] [d000000009b7c7e4] cxl_remove+0x6c/0x1a0 [cxl]
      [c000000f1680b990] [c00000000052f694] pci_device_remove+0x64/0x110
      [c000000f1680b9d0] [c0000000006321d4] device_release_driver_internal+0x1f4/0x2b0
      [c000000f1680ba20] [c000000000525cb0] pci_stop_bus_device+0xa0/0xd0
      [c000000f1680ba60] [c000000000525e80] pci_stop_and_remove_bus_device+0x20/0x40
      [c000000f1680ba90] [c00000000004a6c4] pci_hp_remove_devices+0x84/0xc0
      [c000000f1680bad0] [c00000000004a688] pci_hp_remove_devices+0x48/0xc0
      [c000000f1680bb10] [c0000000009dfda4] eeh_reset_device+0xb0/0x290
      [c000000f1680bbb0] [c000000000032b4c] eeh_handle_normal_event+0x47c/0x530
      [c000000f1680bc60] [c000000000032e64] eeh_handle_event+0x174/0x350
      [c000000f1680bd10] [c000000000033228] eeh_event_handler+0x1e8/0x1f0
      [c000000f1680bdc0] [c0000000000d384c] kthread+0x14c/0x190
      [c000000f1680be30] [c00000000000b5a0] ret_from_kernel_thread+0x5c/0xbc
      
      Fixes: f83f3c51 ("kernfs: fix locking around kernfs_ops->release() callback")
      Signed-off-by: NVaibhav Jain <vaibhav@linux.vnet.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      966fa72a
    • D
      afs: Don't wait for page writeback with the page lock held · c5051c7b
      David Howells 提交于
      Drop the page lock before waiting for page writeback.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c5051c7b
    • D
      afs: ->writepage() shouldn't call clear_page_dirty_for_io() · 65a15109
      David Howells 提交于
      The ->writepage() op shouldn't call clear_page_dirty_for_io() as that has
      already been called by the caller.
      
      Fix afs_writepage() by moving the call out of
      afs_write_back_from_locked_page() to afs_writepages_region() where it is
      needed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      65a15109
    • D
      afs: Fix abort on signal while waiting for call completion · 954cd6dc
      David Howells 提交于
      Fix the way in which a call that's in progress and being waited for is
      aborted in the case that EINTR is detected.  We should be sending
      RX_USER_ABORT rather than RX_CALL_DEAD as the abort code.
      
      Note that since the only two ways out of the loop are if the call completes
      or if a signal happens, the kill-the-call clause after the loop has
      finished can only happen in the case of EINTR.  This means that we only
      have one abort case to deal with, not two, and the "KWC" case can never
      happen and so can be deleted.
      
      Note further that simply aborting the call isn't necessarily the best thing
      here since at this point: the request has been entirely sent and it's
      likely the server will do the operation anyway - whether we abort it or
      not.  In future, we should punt the handling of the remainder of the call
      off to a background thread.
      Reported-by: NMarc Dionne <marc.c.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      954cd6dc
    • D
      afs: Fix an off-by-one error in afs_send_pages() · 445783d0
      David Howells 提交于
      afs_send_pages() should only put the call into the AFS_CALL_AWAIT_REPLY
      state if it has sent all the pages - but the check it makes is incorrect
      and sometimes it will finish the loop early.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      445783d0
    • D
      afs: Fix afs_kill_pages() · 7286a35e
      David Howells 提交于
      Fix afs_kill_pages() in two ways:
      
       (1) If a writeback has been partially flushed, then if we try and kill the
           pages it contains, some of them may no longer be undergoing writeback
           and end_page_writeback() will assert.
      
           Fix this by checking to see whether the page in question is actually
           undergoing writeback before ending that writeback.
      
       (2) The loop that scans for pages to kill doesn't increase the first page
           index, and so the loop may not terminate, but it will try to process
           the same pages over and over again.
      
           Fix this by increasing the first page index to one after the last page
           we processed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7286a35e
    • D
      afs: Fix page leak in afs_write_begin() · 6d06b0d2
      David Howells 提交于
      afs_write_begin() leaks a ref and a lock on a page if afs_fill_page()
      fails.  Fix the leak by unlocking and releasing the page in the error path.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6d06b0d2
    • D
      afs: Don't set PG_error on local EINTR or ENOMEM when filling a page · 68ae849d
      David Howells 提交于
      Don't set PG_error on a page if we get local EINTR or ENOMEM when filling a
      page for writing.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      68ae849d
    • M
      afs: Populate and use client modification time · ab94f5d0
      Marc Dionne 提交于
      The inode timestamps should be set from the client time
      in the status received from the server, rather than the
      server time which is meant for internal server use.
      
      Set AFS_SET_MTIME and populate the mtime for operations
      that take an input status, such as file/dir creation
      and StoreData.  If an input time is not provided the
      server will set the vnode times based on the current server
      time.
      
      In a situation where the server has some skew with the
      client, this could lead to the client seeing a timestamp
      in the future for a file that it just created or wrote.
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ab94f5d0
    • D
      afs: Better abort and net error handling · 70af0e3b
      David Howells 提交于
      If we receive a network error, a remote abort or a protocol error whilst
      we're still transmitting data, make sure we return an appropriate error to
      the caller rather than ESHUTDOWN or ECONNABORTED.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      70af0e3b