1. 24 3月, 2019 16 次提交
    • Y
      ext4: fix check of inode in swap_inode_boot_loader · cdf9941b
      yangerkun 提交于
      commit 67a11611e1a5211f6569044fbf8150875764d1d0 upstream.
      
      Before really do swap between inode and boot inode, something need to
      check to avoid invalid or not permitted operation, like does this inode
      has inline data. But the condition check should be protected by inode
      lock to avoid change while swapping. Also some other condition will not
      change between swapping, but there has no problem to do this under inode
      lock.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdf9941b
    • F
      Btrfs: fix corruption reading shared and compressed extents after hole punching · 898488e2
      Filipe Manana 提交于
      commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream.
      
      In the past we had data corruption when reading compressed extents that
      are shared within the same file and they are consecutive, this got fixed
      by commit 005efedf ("Btrfs: fix read corruption of compressed and
      shared extents") and by commit 808f80b4 ("Btrfs: update fix for read
      corruption of compressed and shared extents"). However there was a case
      that was missing in those fixes, which is when the shared and compressed
      extents are referenced with a non-zero offset. The following shell script
      creates a reproducer for this issue:
      
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdc &> /dev/null
        mount -o compress /dev/sdc /mnt/sdc
      
        # Create a file with 3 consecutive compressed extents, each has an
        # uncompressed size of 128Kb and a compressed size of 4Kb.
        for ((i = 1; i <= 3; i++)); do
            head -c 4096 /dev/zero
            for ((j = 1; j <= 31; j++)); do
                head -c 4096 /dev/zero | tr '\0' "\377"
            done
        done > /mnt/sdc/foobar
        sync
      
        echo "Digest after file creation:   $(md5sum /mnt/sdc/foobar)"
      
        # Clone the first extent into offsets 128K and 256K.
        xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
        xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
        sync
      
        echo "Digest after cloning:         $(md5sum /mnt/sdc/foobar)"
      
        # Punch holes into the regions that are already full of zeroes.
        xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
        sync
      
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        echo "Dropping page cache..."
        sysctl -q vm.drop_caches=1
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        umount /dev/sdc
      
      When running the script we get the following output:
      
        Digest after file creation:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        linked 131072/131072 bytes at offset 131072
        128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
        linked 131072/131072 bytes at offset 262144
        128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
        Digest after cloning:         5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Digest after hole punching:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Dropping page cache...
        Digest after hole punching:   fba694ae8664ed0c2e9ff8937e7f1484  /mnt/sdc/foobar
      
      This happens because after reading all the pages of the extent in the
      range from 128K to 256K for example, we read the hole at offset 256K
      and then when reading the page at offset 260K we don't submit the
      existing bio, which is responsible for filling all the page in the
      range 128K to 256K only, therefore adding the pages from range 260K
      to 384K to the existing bio and submitting it after iterating over the
      entire range. Once the bio completes, the uncompressed data fills only
      the pages in the range 128K to 256K because there's no more data read
      from disk, leaving the pages in the range 260K to 384K unfilled. It is
      just a slightly different variant of what was solved by commit
      005efedf ("Btrfs: fix read corruption of compressed and shared
      extents").
      
      Fix this by forcing a bio submit, during readpages(), whenever we find a
      compressed extent map for a page that is different from the extent map
      for the previous page or has a different starting offset (in case it's
      the same compressed extent), instead of the extent map's original start
      offset.
      
      A test case for fstests follows soon.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 808f80b4 ("Btrfs: update fix for read corruption of compressed and shared extents")
      Fixes: 005efedf ("Btrfs: fix read corruption of compressed and shared extents")
      Cc: stable@vger.kernel.org # 4.3+
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      898488e2
    • J
      btrfs: ensure that a DUP or RAID1 block group has exactly two stripes · 1a00f7fd
      Johannes Thumshirn 提交于
      commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream.
      
      We recently had a customer issue with a corrupted filesystem. When
      trying to mount this image btrfs panicked with a division by zero in
      calc_stripe_length().
      
      The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
      takes this value and divides it by the number of copies the RAID profile
      is expected to have to calculate the amount of data stripes. As a DUP
      profile is expected to have 2 copies this division resulted in 1/2 = 0.
      Later then the 'data_stripes' variable is used as a divisor in the
      stripe length calculation which results in a division by 0 and thus a
      kernel panic.
      
      When encountering a filesystem with a DUP block group and a
      'num_stripes' value unequal to 2, refuse mounting as the image is
      corrupted and will lead to unexpected behaviour.
      
      Code inspection showed a RAID1 block group has the same issues.
      
      Fixes: e06cd3dd ("Btrfs: add validadtion checks for chunk loading")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a00f7fd
    • F
      Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl · 6e24f5a1
      Filipe Manana 提交于
      commit a0873490660246db587849a9e172f2b7b21fa88a upstream.
      
      We are holding a transaction handle when setting an acl, therefore we can
      not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
      if reclaim is triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e24f5a1
    • F
      Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() · 61f92096
      Filipe Manana 提交于
      commit b89f6d1fcb30a8cbdc18ce00c7d93792076af453 upstream.
      
      We are holding a transaction handle when creating a tree, therefore we can
      not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
      triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 74e4d827 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61f92096
    • V
      ovl: Do not lose security.capability xattr over metadata file copy-up · 205f149f
      Vivek Goyal 提交于
      commit 993a0b2aec52754f0897b1dab4c453be8217cae5 upstream.
      
      If a file has been copied up metadata only, and later data is copied up,
      upper loses any security.capability xattr it has (underlying filesystem
      clears it as upon file write).
      
      From a user's point of view, this is just a file copy-up and that should
      not result in losing security.capability xattr.  Hence, before data copy
      up, save security.capability xattr (if any) and restore it on upper after
      data copy up is complete.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: 0c288874 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper")
      Cc: <stable@vger.kernel.org> # v4.19+
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      205f149f
    • V
      ovl: During copy up, first copy up data and then xattrs · 6f048ae2
      Vivek Goyal 提交于
      commit 5f32879ea35523b9842bdbdc0065e13635caada2 upstream.
      
      If a file with capability set (and hence security.capability xattr) is
      written kernel clears security.capability xattr. For overlay, during file
      copy up if xattrs are copied up first and then data is, copied up. This
      means data copy up will result in clearing of security.capability xattr
      file on lower has. And this can result into surprises. If a lower file has
      CAP_SETUID, then it should not be cleared over copy up (if nothing was
      actually written to file).
      
      This also creates problems with chown logic where it first copies up file
      and then tries to clear setuid bit. But by that time security.capability
      xattr is already gone (due to data copy up), and caller gets -ENODATA.
      This has been reported by Giuseppe here.
      
      https://github.com/containers/libpod/issues/2015#issuecomment-447824842
      
      Fix this by copying up data first and then metadta. This is a regression
      which has been introduced by my commit as part of metadata only copy up
      patches.
      
      TODO: There will be some corner cases where a file is copied up metadata
      only and later data copy up happens and that will clear security.capability
      xattr. Something needs to be done about that too.
      
      Fixes: bd64e575 ("ovl: During copy up, first copy up metadata and then data")
      Cc: <stable@vger.kernel.org> # v4.19+
      Reported-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f048ae2
    • J
      splice: don't merge into linked buffers · 2af926fd
      Jann Horn 提交于
      commit a0ce2f0aa6ad97c3d4927bf2ca54bcebdf062d55 upstream.
      
      Before this patch, it was possible for two pipes to affect each other after
      data had been transferred between them with tee():
      
      ============
      $ cat tee_test.c
      
      int main(void) {
        int pipe_a[2];
        if (pipe(pipe_a)) err(1, "pipe");
        int pipe_b[2];
        if (pipe(pipe_b)) err(1, "pipe");
        if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
        if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
        if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
      
        char buf[5];
        if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
        buf[4] = 0;
        printf("got back: '%s'\n", buf);
      }
      $ gcc -o tee_test tee_test.c
      $ ./tee_test
      got back: 'abxx'
      $
      ============
      
      As suggested by Al Viro, fix it by creating a separate type for
      non-mergeable pipe buffers, then changing the types of buffers in
      splice_pipe_to_pipe() and link_pipe().
      
      Cc: <stable@vger.kernel.org>
      Fixes: 7c77f0b3 ("splice: implement pipe to pipe splicing")
      Fixes: 70524490 ("[PATCH] splice: add support for sys_tee()")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2af926fd
    • V
      fs/devpts: always delete dcache dentry-s in dput() · 1c2123ff
      Varad Gautam 提交于
      commit 73052b0daee0b750b39af18460dfec683e4f5887 upstream.
      
      d_delete only unhashes an entry if it is reached with
      dentry->d_lockref.count != 1. Prior to commit 8ead9dd5 ("devpts:
      more pty driver interface cleanups"), d_delete was called on a dentry
      from devpts_pty_kill with two references held, which would trigger the
      unhashing, and the subsequent dputs would release it.
      
      Commit 8ead9dd5 reworked devpts_pty_kill to stop acquiring the second
      reference from d_find_alias, and the d_delete call left the dentries
      still on the hashed list without actually ever being dropped from dcache
      before explicit cleanup. This causes the number of negative dentries for
      devpts to pile up, and an `ls /dev/pts` invocation can take seconds to
      return.
      
      Provide always_delete_dentry() from simple_dentry_operations
      as .d_delete for devpts, to make the dentry be dropped from dcache.
      
      Without this cleanup, the number of dentries in /dev/pts/ can be grown
      arbitrarily as:
      
      `python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'`
      
      A systemtap probe on dcache_readdir to count d_subdirs shows this count
      to increase with each pty spawn invocation above:
      
      probe kernel.function("dcache_readdir") {
          subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs;
          p = subdirs;
          p = @cast(p, "list_head")->next;
          i = 0
          while (p != subdirs) {
            p = @cast(p, "list_head")->next;
            i = i+1;
          }
          printf("number of dentries: %d\n", i);
      }
      
      Fixes: 8ead9dd5 ("devpts: more pty driver interface cleanups")
      Signed-off-by: NVarad Gautam <vrd@amazon.de>
      Reported-by: NZheng Wang <wanz@amazon.de>
      Reported-by: NBrandon Schwartz <bsschwar@amazon.de>
      Root-caused-by: NMaximilian Heyne <mheyne@amazon.de>
      Root-caused-by: NNicolas Pernas Maradei <npernas@amazon.de>
      CC: David Woodhouse <dwmw@amazon.co.uk>
      CC: Maximilian Heyne <mheyne@amazon.de>
      CC: Stefan Nuernberger <snu@amazon.de>
      CC: Amit Shah <aams@amazon.de>
      CC: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: Matthew Wilcox <willy@infradead.org>
      CC: Eric Biggers <ebiggers@google.com>
      CC: <stable@vger.kernel.org> # 4.9+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c2123ff
    • P
      CIFS: Fix read after write for files with read caching · 43eaa6cc
      Pavel Shilovsky 提交于
      commit 6dfbd84684700cb58b34e8602c01c12f3d2595c8 upstream.
      
      When we have a READ lease for a file and have just issued a write
      operation to the server we need to purge the cache and set oplock/lease
      level to NONE to avoid reading stale data. Currently we do that
      only if a write operation succedeed thus not covering cases when
      a request was sent to the server but a negative error code was
      returned later for some other reasons (e.g. -EIOCBQUEUED or -EINTR).
      Fix this by turning off caching regardless of the error code being
      returned.
      
      The patches fixes generic tests 075 and 112 from the xfs-tests.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43eaa6cc
    • P
      CIFS: Do not skip SMB2 message IDs on send failures · dc8e8ad9
      Pavel Shilovsky 提交于
      commit c781af7e0c1fed9f1d0e0ec31b86f5b21a8dca17 upstream.
      
      When we hit failures during constructing MIDs or sending PDUs
      through the network, we end up not using message IDs assigned
      to the packet. The next SMB packet will skip those message IDs
      and continue with the next one. This behavior may lead to a server
      not granting us credits until we use the skipped IDs. Fix this by
      reverting the current ID to the original value if any errors occur
      before we push the packet through the network stack.
      
      This patch fixes the generic/310 test from the xfs-tests.
      
      Cc: <stable@vger.kernel.org> # 4.19.x
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc8e8ad9
    • P
      CIFS: Do not reset lease state to NONE on lease break · 3ed9f22e
      Pavel Shilovsky 提交于
      commit 7b9b9edb49ad377b1e06abf14354c227e9ac4b06 upstream.
      
      Currently on lease break the client sets a caching level twice:
      when oplock is detected and when oplock is processed. While the
      1st attempt sets the level to the value provided by the server,
      the 2nd one resets the level to None unconditionally.
      This happens because the oplock/lease processing code was changed
      to avoid races between page cache flushes and oplock breaks.
      The commit c11f1df5 ("cifs: Wait for writebacks to complete
      before attempting write.") fixed the races for oplocks but didn't
      apply the same changes for leases resulting in overwriting the
      server granted value to None. Fix this by properly processing
      lease breaks.
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ed9f22e
    • A
      fix cgroup_do_mount() handling of failure exits · 7a8b0484
      Al Viro 提交于
      commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.
      
      same story as with last May fixes in sysfs (7b745a4e
      "unfuck sysfs_mount()"); new_sb is left uninitialized
      in case of early errors in kernfs_mount_ns() and papering
      over it by treating any error from kernfs_mount_ns() as
      equivalent to !new_ns ends up conflating the cases when
      objects had never been transferred to a superblock with
      ones when that has happened and resulting new superblock
      had been dropped.  Easily fixed (same way as in sysfs
      case).  Additionally, there's a superblock leak on
      kernfs_node_dentry() failure *and* a dentry leak inside
      kernfs_node_dentry() itself - the latter on probably
      impossible errors, but the former not impossible to trigger
      (as the matter of fact, injecting allocation failures
      at that point *does* trigger it).
      
      Cc: stable@kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a8b0484
    • D
      keys: Fix dependency loop between construction record and auth key · 72683282
      David Howells 提交于
      [ Upstream commit 822ad64d7e46a8e2c8b8a796738d7b657cbb146d ]
      
      In the request_key() upcall mechanism there's a dependency loop by which if
      a key type driver overrides the ->request_key hook and the userspace side
      manages to lose the authorisation key, the auth key and the internal
      construction record (struct key_construction) can keep each other pinned.
      
      Fix this by the following changes:
      
       (1) Killing off the construction record and using the auth key instead.
      
       (2) Including the operation name in the auth key payload and making the
           payload available outside of security/keys/.
      
       (3) The ->request_key hook is given the authkey instead of the cons
           record and operation name.
      
      Changes (2) and (3) allow the auth key to naturally be cleaned up if the
      keyring it is in is destroyed or cleared or the auth key is unlinked.
      
      Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <james.morris@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      72683282
    • B
      NFS: Don't use page_file_mapping after removing the page · 6c023d86
      Benjamin Coddington 提交于
      [ Upstream commit d2ceb7e57086750ea6198a31fd942d98099a0786 ]
      
      If nfs_page_async_flush() removes the page from the mapping, then we can't
      use page_file_mapping() on it as nfs_updatepate() is wont to do when
      receiving an error.  Instead, push the mapping to the stack before the page
      is possibly truncated.
      
      Fixes: 8fc75bed96bb ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()")
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6c023d86
    • H
      9p: use inode->i_lock to protect i_size_write() under 32-bit · e08ba890
      Hou Tao 提交于
      commit 5e3cc1ee1405a7eb3487ed24f786dec01b4cbe1f upstream.
      
      Use inode->i_lock to protect i_size_write(), else i_size_read() in
      generic_fillattr() may loop infinitely in read_seqcount_begin() when
      multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl()
      simultaneously under 32-bit SMP environment, and a soft lockup will be
      triggered as show below:
      
        watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217]
        Modules linked in:
        CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
        Hardware name: Generic DT based system
        PC is at generic_fillattr+0x104/0x108
        LR is at 0xec497f00
        pc : [<802b8898>]    lr : [<ec497f00>]    psr: 200c0013
        sp : ec497e20  ip : ed608030  fp : ec497e3c
        r10: 00000000  r9 : ec497f00  r8 : ed608030
        r7 : ec497ebc  r6 : ec497f00  r5 : ee5c1550  r4 : ee005780
        r3 : 0000052d  r2 : 00000000  r1 : ec497f00  r0 : ed608030
        Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
        Control: 10c5387d  Table: ac48006a  DAC: 00000051
        CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
        Hardware name: Generic DT based system
        Backtrace:
        [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24)
        [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc)
        [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20)
        [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8)
        [<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380)
        [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0)
        [<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64)
        [<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c)
        [<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc)
        [<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48)
        [<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240)
        [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44)
        [<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4)
        [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88)
        [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98)
        [<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4)
        [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c)
        [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48)
        [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec)
        [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78)
        [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28)
      
      [dominique.martinet@cea.fr: updated comment to not refer to a function
      in another subsystem]
      Link: http://lkml.kernel.org/r/20190124063514.8571-2-houtao1@huawei.com
      Cc: stable@vger.kernel.org
      Fixes: 7549ae3e ("9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.")
      Reported-by: NXing Gaopeng <xingaopeng@huawei.com>
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e08ba890
  2. 19 3月, 2019 1 次提交
  3. 14 3月, 2019 11 次提交
  4. 10 3月, 2019 2 次提交
    • Y
      exec: Fix mem leak in kernel_read_file · b60d90b2
      YueHaibing 提交于
      commit f612acfae86af7ecad754ae6a46019be9da05b8e upstream.
      
      syzkaller report this:
      BUG: memory leak
      unreferenced object 0xffffc9000488d000 (size 9195520):
        comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00  ................
          02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff  ..........z.....
        backtrace:
          [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline]
          [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
          [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
          [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
          [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
          [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
          [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
          [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<00000000241f889b>] 0xffffffffffffffff
      
      It should goto 'out_free' lable to free allocated buf while kernel_read
      fails.
      
      Fixes: 39d637af ("vfs: forbid write access when reading a file into memory")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Thibaut Sautereau <thibaut@sautereau.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b60d90b2
    • B
      aio: Fix locking in aio_poll() · f5e66cdb
      Bart Van Assche 提交于
      commit d3d6a18d7d351cbcc9b33dbedf710e65f8ce1595 upstream.
      
      wake_up_locked() may but does not have to be called with interrupts
      disabled. Since the fuse filesystem calls wake_up_locked() without
      disabling interrupts aio_poll_wake() may be called with interrupts
      enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
      all code that acquires that lock from thread context must disable
      interrupts. Hence change the spin_trylock() call in aio_poll_wake()
      into a spin_trylock_irqsave() call. This patch fixes the following
      lockdep complaint:
      
      =====================================================
      WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      5.0.0-rc4-next-20190131 #23 Not tainted
      -----------------------------------------------------
      syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
      0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
      
      and this task is already holding:
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
      000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
      which would create a new lock dependency:
       (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      
      but this new dependency connects a SOFTIRQ-irq-safe lock:
       (&(&ctx->ctx_lock)->rlock){..-.}
      
      ... which became SOFTIRQ-irq-safe at:
        lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
        __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
        _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
        spin_lock_irq include/linux/spinlock.h:354 [inline]
        free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
        percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
        percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
        percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
        percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
        __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
        rcu_do_batch kernel/rcu/tree.c:2486 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
        rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
        __do_softirq+0x266/0x95a kernel/softirq.c:292
        run_ksoftirqd kernel/softirq.c:654 [inline]
        run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
        smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
        kthread+0x357/0x430 kernel/kthread.c:247
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      to a SOFTIRQ-irq-unsafe lock:
       (&fiq->waitq){+.+.}
      
      ... which became SOFTIRQ-irq-unsafe at:
      ...
        lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
        spin_lock include/linux/spinlock.h:329 [inline]
        flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
        fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
        fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
        fuse_send_init fs/fuse/inode.c:989 [inline]
        fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
        mount_nodev+0x68/0x110 fs/super.c:1392
        fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
        legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
        vfs_get_tree+0x123/0x450 fs/super.c:1481
        do_new_mount fs/namespace.c:2610 [inline]
        do_mount+0x1436/0x2c40 fs/namespace.c:2932
        ksys_mount+0xdb/0x150 fs/namespace.c:3148
        __do_sys_mount fs/namespace.c:3162 [inline]
        __se_sys_mount fs/namespace.c:3159 [inline]
        __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      other info that might help us debug this:
      
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&fiq->waitq);
                                     local_irq_disable();
                                     lock(&(&ctx->ctx_lock)->rlock);
                                     lock(&fiq->waitq);
        <Interrupt>
          lock(&(&ctx->ctx_lock)->rlock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor2/13779:
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
       #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
      
      the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
      -> (&(&ctx->ctx_lock)->rlock){..-.} {
         IN-SOFTIRQ-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                          _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                          spin_lock_irq include/linux/spinlock.h:354 [inline]
                          free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
                          percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
                          percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
                          percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
                          percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
                          __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
                          rcu_do_batch kernel/rcu/tree.c:2486 [inline]
                          invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
                          rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
                          __do_softirq+0x266/0x95a kernel/softirq.c:292
                          run_ksoftirqd kernel/softirq.c:654 [inline]
                          run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
                          smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
                          kthread+0x357/0x430 kernel/kthread.c:247
                          ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
         INITIAL USE at:
                         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                         __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                         _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                         spin_lock_irq include/linux/spinlock.h:354 [inline]
                         __do_sys_io_cancel fs/aio.c:2052 [inline]
                         __se_sys_io_cancel fs/aio.c:2035 [inline]
                         __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
                         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                         entry_SYSCALL_64_after_hwframe+0x49/0xbe
       }
       ... key      at: [<ffffffff8a574140>] __key.52370+0x0/0x40
       ... acquired at:
         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
         spin_lock include/linux/spinlock.h:329 [inline]
         aio_poll fs/aio.c:1772 [inline]
         __io_submit_one fs/aio.c:1875 [inline]
         io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
         __do_sys_io_submit fs/aio.c:1953 [inline]
         __se_sys_io_submit fs/aio.c:1923 [inline]
         __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      the dependencies between the lock to be acquired
       and SOFTIRQ-irq-unsafe lock:
      -> (&fiq->waitq){+.+.} {
         HARDIRQ-ON-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                          _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                          spin_lock include/linux/spinlock.h:329 [inline]
                          flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                          fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                          fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                          fuse_send_init fs/fuse/inode.c:989 [inline]
                          fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                          mount_nodev+0x68/0x110 fs/super.c:1392
                          fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                          legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                          vfs_get_tree+0x123/0x450 fs/super.c:1481
                          do_new_mount fs/namespace.c:2610 [inline]
                          do_mount+0x1436/0x2c40 fs/namespace.c:2932
                          ksys_mount+0xdb/0x150 fs/namespace.c:3148
                          __do_sys_mount fs/namespace.c:3162 [inline]
                          __se_sys_mount fs/namespace.c:3159 [inline]
                          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                          do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                          entry_SYSCALL_64_after_hwframe+0x49/0xbe
         SOFTIRQ-ON-W at:
                          lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                          _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                          spin_lock include/linux/spinlock.h:329 [inline]
                          flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                          fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                          fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                          fuse_send_init fs/fuse/inode.c:989 [inline]
                          fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                          mount_nodev+0x68/0x110 fs/super.c:1392
                          fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                          legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                          vfs_get_tree+0x123/0x450 fs/super.c:1481
                          do_new_mount fs/namespace.c:2610 [inline]
                          do_mount+0x1436/0x2c40 fs/namespace.c:2932
                          ksys_mount+0xdb/0x150 fs/namespace.c:3148
                          __do_sys_mount fs/namespace.c:3162 [inline]
                          __se_sys_mount fs/namespace.c:3159 [inline]
                          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                          do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                          entry_SYSCALL_64_after_hwframe+0x49/0xbe
         INITIAL USE at:
                         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                         spin_lock include/linux/spinlock.h:329 [inline]
                         flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                         fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                         fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                         fuse_send_init fs/fuse/inode.c:989 [inline]
                         fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                         mount_nodev+0x68/0x110 fs/super.c:1392
                         fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                         legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                         vfs_get_tree+0x123/0x450 fs/super.c:1481
                         do_new_mount fs/namespace.c:2610 [inline]
                         do_mount+0x1436/0x2c40 fs/namespace.c:2932
                         ksys_mount+0xdb/0x150 fs/namespace.c:3148
                         __do_sys_mount fs/namespace.c:3162 [inline]
                         __se_sys_mount fs/namespace.c:3159 [inline]
                         __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                         entry_SYSCALL_64_after_hwframe+0x49/0xbe
       }
       ... key      at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40
       ... acquired at:
         lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
         __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
         _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
         spin_lock include/linux/spinlock.h:329 [inline]
         aio_poll fs/aio.c:1772 [inline]
         __io_submit_one fs/aio.c:1875 [inline]
         io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
         __do_sys_io_submit fs/aio.c:1953 [inline]
         __se_sys_io_submit fs/aio.c:1923 [inline]
         __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      stack backtrace:
      CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
       check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
       check_irq_usage kernel/locking/lockdep.c:1650 [inline]
       check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
       check_prev_add kernel/locking/lockdep.c:1860 [inline]
       check_prevs_add kernel/locking/lockdep.c:1968 [inline]
       validate_chain kernel/locking/lockdep.c:2339 [inline]
       __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
       lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
       spin_lock include/linux/spinlock.h:329 [inline]
       aio_poll fs/aio.c:1772 [inline]
       __io_submit_one fs/aio.c:1875 [inline]
       io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
       __do_sys_io_submit fs/aio.c:1953 [inline]
       __se_sys_io_submit fs/aio.c:1923 [inline]
       __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>
      Fixes: e8693bcf ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      [ bvanassche: added a comment ]
      Reluctantly-Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5e66cdb
  5. 06 3月, 2019 3 次提交
    • M
      hugetlbfs: fix races and page leaks during migration · 527cabff
      Mike Kravetz 提交于
      commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream.
      
      hugetlb pages should only be migrated if they are 'active'.  The
      routines set/clear_page_huge_active() modify the active state of hugetlb
      pages.
      
      When a new hugetlb page is allocated at fault time, set_page_huge_active
      is called before the page is locked.  Therefore, another thread could
      race and migrate the page while it is being added to page table by the
      fault code.  This race is somewhat hard to trigger, but can be seen by
      strategically adding udelay to simulate worst case scheduling behavior.
      Depending on 'how' the code races, various BUG()s could be triggered.
      
      To address this issue, simply delay the set_page_huge_active call until
      after the page is successfully added to the page table.
      
      Hugetlb pages can also be leaked at migration time if the pages are
      associated with a file in an explicitly mounted hugetlbfs filesystem.
      For example, consider a two node system with 4GB worth of huge pages
      available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
      then migrates the pages associated with the file from one node to
      another.  When the program exits, huge page counts are as follows:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        0       free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      That is as expected.  2G of huge pages are taken from the free_hugepages
      counts, and 2G is the size of the file in the explicitly mounted
      filesystem.  If the file is then removed, the counts become:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        1024    free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      Note that the filesystem still shows 2G of pages used, while there
      actually are no huge pages in use.  The only way to 'fix' the filesystem
      accounting is to unmount the filesystem
      
      If a hugetlb page is associated with an explicitly mounted filesystem,
      this information in contained in the page_private field.  At migration
      time, this information is not preserved.  To fix, simply transfer
      page_private from old to new page at migration time if necessary.
      
      There is a related race with removing a huge page from a file and
      migration.  When a huge page is removed from the pagecache, the
      page_mapping() field is cleared, yet page_private remains set until the
      page is actually freed by free_huge_page().  A page could be migrated
      while in this state.  However, since page_mapping() is not set the
      hugetlbfs specific routine to transfer page_private is not called and we
      leak the page count in the filesystem.
      
      To fix that, check for this condition before migrating a huge page.  If
      the condition is detected, return EBUSY for the page.
      
      Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
      Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
      Fixes: bcc54222 ("mm: hugetlb: introduce page_huge_active")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      [mike.kravetz@oracle.com: v2]
        Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
      [mike.kravetz@oracle.com: update comment and changelog]
        Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.comSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      527cabff
    • T
      writeback: synchronize sync(2) against cgroup writeback membership switches · edca54b8
      Tejun Heo 提交于
      [ Upstream commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 ]
      
      sync_inodes_sb() can race against cgwb (cgroup writeback) membership
      switches and fail to writeback some inodes.  For example, if an inode
      switches to another wb while sync_inodes_sb() is in progress, the new
      wb might not be visible to bdi_split_work_to_wbs() at all or the inode
      might jump from a wb which hasn't issued writebacks yet to one which
      already has.
      
      This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
      switch path against sync_inodes_sb() so that sync_inodes_sb() is
      guaranteed to see all the target wbs and inodes can't jump wbs to
      escape syncing.
      
      v2: Fixed misplaced rwsem init.  Spotted by Jiufei.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJiufei Xue <xuejiufei@gmail.com>
      Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.comAcked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      edca54b8
    • E
      direct-io: allow direct writes to empty inodes · c5a1dc25
      Ernesto A. Fernández 提交于
      [ Upstream commit 8b9433eb4de3c26a9226c981c283f9f4896ae030 ]
      
      On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently
      not allowed to create blocks for an empty inode.  This confusion comes
      from trying to bit shift a negative number, so check the size of the
      inode first.
      
      The problem is most visible for hfsplus, because the fallback to
      buffered I/O doesn't happen and the write fails with EIO.  This is in
      part the fault of the module, because it gives a wrong return value on
      ->get_block(); that will be fixed in a separate patch.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NErnesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c5a1dc25
  6. 27 2月, 2019 4 次提交
  7. 20 2月, 2019 3 次提交