1. 08 10月, 2019 19 次提交
  2. 05 10月, 2019 21 次提交
    • G
      Linux 4.19.77 · 6cad9d0c
      Greg Kroah-Hartman 提交于
      6cad9d0c
    • K
      drm/amd/display: Restore backlight brightness after system resume · 2c60da90
      Kai-Heng Feng 提交于
      commit bb264220d9316f6bd7c1fd84b8da398c93912931 upstream.
      
      Laptops with AMD APU doesn't restore display backlight brightness after
      system resume.
      
      This issue started when DC was introduced.
      
      Let's use BL_CORE_SUSPENDRESUME so the backlight core calls
      update_status callback after system resume to restore the backlight
      level.
      
      Tested on Dell Inspiron 3180 (Stoney Ridge) and Dell Latitude 5495
      (Raven Ridge).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c60da90
    • Y
      mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone · 4d8bdf7f
      Yafang Shao 提交于
      [ Upstream commit a94b525241c0fff3598809131d7cfcfe1d572d8c ]
      
      total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
      COMPACTFREE_SCANNED in compact_zone().  We should clear them before
      scanning a new zone.  In the proc triggered compaction, we forgot clearing
      them.
      
      [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
        Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
      [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
      [vbabka@suse.cz: squash compact_zone() list_head init as well]
        Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
      [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
      Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
      Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <shaoyafang@didiglobal.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d8bdf7f
    • E
      fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock · 5bead06b
      Eric Biggers 提交于
      [ Upstream commit 76e43c8ccaa35c30d5df853013561145a0f750a5 ]
      
      When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
      
      This may have to wait for fuse_iqueue::waitq.lock to be released by one
      of many places that take it with IRQs enabled.  Since the IRQ handler
      may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
      
      Fix it by protecting the state of struct fuse_iqueue with a separate
      spinlock, and only accessing fuse_iqueue::waitq using the versions of
      the waitqueue functions which do IRQ-safe locking internally.
      
      Reproducer:
      
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <sys/mount.h>
      	#include <sys/stat.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      	#include <linux/aio_abi.h>
      
      	int main()
      	{
      		char opts[128];
      		int fd = open("/dev/fuse", O_RDWR);
      		aio_context_t ctx = 0;
      		struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
      		struct iocb *cbp = &cb;
      
      		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
      		mkdir("mnt", 0700);
      		mount("foo",  "mnt", "fuse", 0, opts);
      		syscall(__NR_io_setup, 1, &ctx);
      		syscall(__NR_io_submit, ctx, 1, &cbp);
      	}
      
      Beginning of lockdep output:
      
      	=====================================================
      	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      	5.3.0-rc5 #9 Not tainted
      	-----------------------------------------------------
      	syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      	000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
      
      	and this task is already holding:
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
      	which would create a new lock dependency:
      	 (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      
      	but this new dependency connects a SOFTIRQ-irq-safe lock:
      	 (&(&ctx->ctx_lock)->rlock){..-.}
      
      	[...]
      
      Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
      Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.19+
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bead06b
    • N
      md/raid0: avoid RAID0 data corruption due to layout confusion. · bbe3e205
      NeilBrown 提交于
      [ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ]
      
      If the drives in a RAID0 are not all the same size, the array is
      divided into zones.
      The first zone covers all drives, to the size of the smallest.
      The second zone covers all drives larger than the smallest, up to
      the size of the second smallest - etc.
      
      A change in Linux 3.14 unintentionally changed the layout for the
      second and subsequent zones.  All the correct data is still stored, but
      each chunk may be assigned to a different device than in pre-3.14 kernels.
      This can lead to data corruption.
      
      It is not possible to determine what layout to use - it depends which
      kernel the data was written by.
      So we add a module parameter to allow the old (0) or new (1) layout to be
      specified, and refused to assemble an affected array if that parameter is
      not set.
      
      Fixes: 20d0189b ("block: Introduce new bio_split()")
      cc: stable@vger.kernel.org (3.14+)
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bbe3e205
    • P
      CIFS: Fix oplock handling for SMB 2.1+ protocols · 4290a9e5
      Pavel Shilovsky 提交于
      commit a016e2794fc3a245a91946038dd8f34d65e53cc3 upstream.
      
      There may be situations when a server negotiates SMB 2.1
      protocol version or higher but responds to a CREATE request
      with an oplock rather than a lease.
      
      Currently the client doesn't handle such a case correctly:
      when another CREATE comes in the server sends an oplock
      break to the initial CREATE and the client doesn't send
      an ack back due to a wrong caching level being set (READ
      instead of RWH). Missing an oplock break ack makes the
      server wait until the break times out which dramatically
      increases the latency of the second CREATE.
      
      Fix this by properly detecting oplocks when using SMB 2.1
      protocol version and higher.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4290a9e5
    • M
      CIFS: fix max ea value size · a3a15089
      Murphy Zhou 提交于
      commit 63d37fb4ce5ae7bf1e58f906d1bf25f036fe79b2 upstream.
      
      It should not be larger then the slab max buf size. If user
      specifies a larger size, it passes this check and goes
      straightly to SMB2_set_info_init performing an insecure memcpy.
      Signed-off-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3a15089
    • C
      i2c: riic: Clear NACK in tend isr · a0f7fd38
      Chris Brandt 提交于
      commit a71e2ac1f32097fbb2beab098687a7a95c84543e upstream.
      
      The NACKF flag should be cleared in INTRIICNAKI interrupt processing as
      description in HW manual.
      
      This issue shows up quickly when PREEMPT_RT is applied and a device is
      probed that is not plugged in (like a touchscreen controller). The result
      is endless interrupts that halt system boot.
      
      Fixes: 310c18a4 ("i2c: riic: add driver")
      Cc: stable@vger.kernel.org
      Reported-by: NChien Nguyen <chien.nguyen.eb@rvc.renesas.com>
      Signed-off-by: NChris Brandt <chris.brandt@renesas.com>
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0f7fd38
    • L
      hwrng: core - don't wait on add_early_randomness() · fec38267
      Laurent Vivier 提交于
      commit 78887832e76541f77169a24ac238fccb51059b63 upstream.
      
      add_early_randomness() is called by hwrng_register() when the
      hardware is added. If this hardware and its module are present
      at boot, and if there is no data available the boot hangs until
      data are available and can't be interrupted.
      
      For instance, in the case of virtio-rng, in some cases the host can be
      not able to provide enough entropy for all the guests.
      
      We can have two easy ways to reproduce the problem but they rely on
      misconfiguration of the hypervisor or the egd daemon:
      
      - if virtio-rng device is configured to connect to the egd daemon of the
      host but when the virtio-rng driver asks for data the daemon is not
      connected,
      
      - if virtio-rng device is configured to connect to the egd daemon of the
      host but the egd daemon doesn't provide data.
      
      The guest kernel will hang at boot until the virtio-rng driver provides
      enough data.
      
      To avoid that, call rng_get_data() in non-blocking mode (wait=0)
      from add_early_randomness().
      Signed-off-by: NLaurent Vivier <lvivier@redhat.com>
      Fixes: d9e79726 ("hwrng: add randomness to system from rng...")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fec38267
    • C
      quota: fix wrong condition in is_quota_modification() · 06098609
      Chao Yu 提交于
      commit 6565c182094f69e4ffdece337d395eb7ec760efc upstream.
      
      Quoted from
      commit 3da40c7b ("ext4: only call ext4_truncate when size <= isize")
      
      " At LSF we decided that if we truncate up from isize we shouldn't trim
        fallocated blocks that were fallocated with KEEP_SIZE and are past the
       new i_size.  This patch fixes ext4 to do this. "
      
      And generic/092 of fstest have covered this case for long time, however
      is_quota_modification() didn't adjust based on that rule, so that in
      below condition, we will lose to quota block change:
      - fallocate blocks beyond EOF
      - remount
      - truncate(file_path, file_size)
      
      Fix it.
      
      Link: https://lore.kernel.org/r/20190911093650.35329-1-yuchao0@huawei.com
      Fixes: 3da40c7b ("ext4: only call ext4_truncate when size <= isize")
      CC: stable@vger.kernel.org
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06098609
    • T
      ext4: fix punch hole for inline_data file systems · 091c754d
      Theodore Ts'o 提交于
      commit c1e8220bd316d8ae8e524df39534b8a412a45d5e upstream.
      
      If a program attempts to punch a hole on an inline data file, we need
      to convert it to a normal file first.
      
      This was detected using ext4/032 using the adv configuration.  Simple
      reproducer:
      
      mke2fs -Fq -t ext4 -O inline_data /dev/vdc
      mount /vdc
      echo "" > /vdc/testfile
      xfs_io -c 'truncate 33554432' /vdc/testfile
      xfs_io -c 'fpunch 0 1048576' /vdc/testfile
      umount /vdc
      e2fsck -fy /dev/vdc
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      091c754d
    • R
      ext4: fix warning inside ext4_convert_unwritten_extents_endio · 775e3e73
      Rakesh Pandit 提交于
      commit e3d550c2c4f2f3dba469bc3c4b83d9332b4e99e1 upstream.
      
      Really enable warning when CONFIG_EXT4_DEBUG is set and fix missing
      first argument.  This was introduced in commit ff95ec22 ("ext4:
      add warning to ext4_convert_unwritten_extents_endio") and splitting
      extents inside endio would trigger it.
      
      Fixes: ff95ec22 ("ext4: add warning to ext4_convert_unwritten_extents_endio")
      Signed-off-by: NRakesh Pandit <rakesh@tuxera.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      775e3e73
    • T
      /dev/mem: Bail out upon SIGKILL. · caa6926d
      Tetsuo Handa 提交于
      commit 8619e5bdeee8b2c685d686281f2d2a6017c4bc15 upstream.
      
      syzbot found that a thread can stall for minutes inside read_mem() or
      write_mem() after that thread was killed by SIGKILL [1]. Reading from
      iomem areas of /dev/mem can be slow, depending on the hardware.
      While reading 2GB at one read() is legal, delaying termination of killed
      thread for minutes is bad. Thus, allow reading/writing /dev/mem and
      /dev/kmem to be preemptible and killable.
      
        [ 1335.912419][T20577] read_mem: sz=4096 count=2134565632
        [ 1335.943194][T20577] read_mem: sz=4096 count=2134561536
        [ 1335.978280][T20577] read_mem: sz=4096 count=2134557440
        [ 1336.011147][T20577] read_mem: sz=4096 count=2134553344
        [ 1336.041897][T20577] read_mem: sz=4096 count=2134549248
      
      Theoretically, reading/writing /dev/mem and /dev/kmem can become
      "interruptible". But this patch chose "killable". Future patch will make
      them "interruptible" so that we can revert to "killable" if some program
      regressed.
      
      [1] https://syzkaller.appspot.com/bug?id=a0e3436829698d5824231251fad9d8e998f94f5eSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: stable <stable@vger.kernel.org>
      Reported-by: Nsyzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/r/1566825205-10703-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caa6926d
    • D
      cfg80211: Purge frame registrations on iftype change · bd3a11af
      Denis Kenzior 提交于
      commit c1d3ad84eae35414b6b334790048406bd6301b12 upstream.
      
      Currently frame registrations are not purged, even when changing the
      interface type.  This can lead to potentially weird situations where
      frames possibly not allowed on a given interface type remain registered
      due to the type switching happening after registration.
      
      The kernel currently relies on userspace apps to actually purge the
      registrations themselves, this is not something that the kernel should
      rely on.
      
      Add a call to cfg80211_mlme_purge_registrations() to forcefully remove
      any registrations left over prior to switching the iftype.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      Link: https://lore.kernel.org/r/20190828211110.15005-1-denkenz@gmail.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd3a11af
    • N
      md: only call set_in_sync() when it is expected to succeed. · 5dc86e95
      NeilBrown 提交于
      commit 480523feae581ab714ba6610388a3b4619a2f695 upstream.
      
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dc86e95
    • N
      md: don't report active array_state until after revalidate_disk() completes. · 598a2cda
      NeilBrown 提交于
      commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream.
      
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      598a2cda
    • X
      md/raid6: Set R5_ReadError when there is read failure on parity disk · e8323e0d
      Xiao Ni 提交于
      commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream.
      
      7471fb77 ("md/raid6: Fix anomily when recovering a single device in
      RAID6.") avoids rereading P when it can be computed from other members.
      However, this misses the chance to re-write the right data to P. This
      patch sets R5_ReadError if the re-read fails.
      
      Also, when re-read is skipped, we also missed the chance to reset
      rdev->read_errors to 0. It can fail the disk when there are many read
      errors on P member disk (other disks don't have read error)
      
      V2: upper layer read request don't read parity/Q data. So there is no
      need to consider such situation.
      
      This is Reported-by: kbuild test robot <lkp@intel.com>
      
      Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
      Cc: <stable@vger.kernel.org> #4.4+
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8323e0d
    • F
      Btrfs: fix race setting up and completing qgroup rescan workers · bacff03b
      Filipe Manana 提交于
      commit 13fc1d271a2e3ab8a02071e711add01fab9271f6 upstream.
      
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
      #      --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
      #      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
      #      @@ -1,2 +1,3 @@
      #       QA output created by 171
      #      +ERROR: quota rescan failed: Operation now in progress
      #       Silence is golden
      #      ...
      #      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8 ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bacff03b
    • Q
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · b5c42ef0
      Qu Wenruo 提交于
      commit d4e204948fe3e0dc8e1fbf3f8f3290c9c2823be3 upstream.
      
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5c42ef0
    • Q
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · c521bfa8
      Qu Wenruo 提交于
      commit bab32fc069ce8829c416e8737c119f62a57970f9 upstream.
      
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c521bfa8
    • N
      btrfs: Relinquish CPUs in btrfs_compare_trees · 067f82a0
      Nikolay Borisov 提交于
      commit 6af112b11a4bc1b560f60a618ac9c1dcefe9836e upstream.
      
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      067f82a0