1. 23 10月, 2019 27 次提交
  2. 07 10月, 2019 1 次提交
    • L
      elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings · b212921b
      Linus Torvalds 提交于
      In commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") we
      changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
      executable mappings.
      
      Then, people reported that it broke some binaries that had overlapping
      segments from the same file, and commit ad55eac7 ("elf: enforce
      MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
      overlaying elf segment cases.  But only some - despite the summary line
      of that commit, it only did it when it also does a temporary brk vma for
      one obvious overlapping case.
      
      Now Russell King reports another overlapping case with old 32-bit x86
      binaries, which doesn't trigger that limited case.  End result: we had
      better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
      
      Yes, it's a sign of old binaries generated with old tool-chains, but we
      do pride ourselves on not breaking existing setups.
      
      This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
      and the old load_elf_library() use-cases, because nobody has reported
      breakage for those. Yet.
      
      Note that in all the cases seen so far, the overlapping elf sections
      seem to be just re-mapping of the same executable with different section
      attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
      flag or similar, which acts like NOREPLACE, but allows just remapping
      the same executable file using different protection flags.
      
      It's not clear that would make a huge difference to anything, but if
      people really hate that "elf remaps over previous maps" behavior, maybe
      at least a more limited form of remapping would alleviate some concerns.
      
      Alternatively, we should take a look at our elf_map() logic to see if we
      end up not mapping things properly the first time.
      
      In the meantime, this is the minimal "don't do that then" patch while
      people hopefully think about it more.
      Reported-by: NRussell King <linux@armlinux.org.uk>
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Fixes: ad55eac7 ("elf: enforce  MAP_FIXED on overlaying elf segments")
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b212921b
  3. 06 10月, 2019 2 次提交
    • L
      Make filldir[64]() verify the directory entry filename is valid · 8a23eb80
      Linus Torvalds 提交于
      This has been discussed several times, and now filesystem people are
      talking about doing it individually at the filesystem layer, so head
      that off at the pass and just do it in getdents{64}().
      
      This is partially based on a patch by Jann Horn, but checks for NUL
      bytes as well, and somewhat simplified.
      
      There's also commentary about how it might be better if invalid names
      due to filesystem corruption don't cause an immediate failure, but only
      an error at the end of the readdir(), so that people can still see the
      filenames that are ok.
      
      There's also been discussion about just how much POSIX strictly speaking
      requires this since it's about filesystem corruption.  It's really more
      "protect user space from bad behavior" as pointed out by Jann.  But
      since Eric Biederman looked up the POSIX wording, here it is for context:
      
       "From readdir:
      
         The readdir() function shall return a pointer to a structure
         representing the directory entry at the current position in the
         directory stream specified by the argument dirp, and position the
         directory stream at the next entry. It shall return a null pointer
         upon reaching the end of the directory stream. The structure dirent
         defined in the <dirent.h> header describes a directory entry.
      
        From definitions:
      
         3.129 Directory Entry (or Link)
      
         An object that associates a filename with a file. Several directory
         entries can associate names with the same file.
      
        ...
      
         3.169 Filename
      
         A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
         characters composing the name may be selected from the set of all
         character values excluding the slash character and the null byte. The
         filenames dot and dot-dot have special meaning. A filename is
         sometimes referred to as a 'pathname component'."
      
      Note that I didn't bother adding the checks to any legacy interfaces
      that nobody uses.
      
      Also note that if this ends up being noticeable as a performance
      regression, we can fix that to do a much more optimized model that
      checks for both NUL and '/' at the same time one word at a time.
      
      We haven't really tended to optimize 'memchr()', and it only checks for
      one pattern at a time anyway, and we really _should_ check for NUL too
      (but see the comment about "soft errors" in the code about why it
      currently only checks for '/')
      
      See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
      lookup code looks for pathname terminating characters in parallel.
      
      Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a23eb80
    • L
      Convert filldir[64]() from __put_user() to unsafe_put_user() · 9f79b78e
      Linus Torvalds 提交于
      We really should avoid the "__{get,put}_user()" functions entirely,
      because they can easily be mis-used and the original intent of being
      used for simple direct user accesses no longer holds in a post-SMAP/PAN
      world.
      
      Manually optimizing away the user access range check makes no sense any
      more, when the range check is generally much cheaper than the "enable
      user accesses" code that the __{get,put}_user() functions still need.
      
      So instead of __put_user(), use the unsafe_put_user() interface with
      user_access_{begin,end}() that really does generate better code these
      days, and which is generally a nicer interface.  Under some loads, the
      multiple user writes that filldir() does are actually quite noticeable.
      
      This also makes the dirent name copy use unsafe_put_user() with a couple
      of macros.  We do not want to make function calls with SMAP/PAN
      disabled, and the code this generates is quite good when the
      architecture uses "asm goto" for unsafe_put_user() like x86 does.
      
      Note that this doesn't bother with the legacy cases.  Nobody should use
      them anyway, so performance doesn't really matter there.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f79b78e
  4. 04 10月, 2019 1 次提交
    • E
      vfs: Fix EOVERFLOW testing in put_compat_statfs64 · cc3a7bfe
      Eric Sandeen 提交于
      Today, put_compat_statfs64() disallows nearly any field value over
      2^32 if f_bsize is only 32 bits, but that makes no sense.
      compat_statfs64 is there for the explicit purpose of providing 64-bit
      fields for f_files, f_ffree, etc.  And f_bsize is always only 32 bits.
      
      As a result, 32-bit userspace gets -EOVERFLOW for i.e.  large file
      counts even with -D_FILE_OFFSET_BITS=64 set.
      
      In reality, only f_bsize and f_frsize can legitimately overflow
      (fields like f_type and f_namelen should never be large), so test
      only those fields.
      
      This bug was discussed at length some time ago, and this is the proposal
      Al suggested at https://lkml.org/lkml/2018/8/6/640.  It seemed to get
      dropped amid the discussion of other related changes, but this
      part seems obviously correct on its own, so I've picked it up and
      sent it, for expediency.
      
      Fixes: 64d2ab32 ("vfs: fix put_compat_statfs64() does not handle errors")
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc3a7bfe
  5. 01 10月, 2019 4 次提交
  6. 30 9月, 2019 1 次提交
    • L
      Revert "Revert "ext4: make __ext4_get_inode_loc plug"" · 02f03c42
      Linus Torvalds 提交于
      This reverts commit 72dbcf72.
      
      Instead of waiting forever for entropy that may just not happen, we now
      try to actively generate entropy when required, and are thus hopefully
      avoiding the problem that caused the nice ext4 IO pattern fix to be
      reverted.
      
      So revert the revert.
      
      Cc: Ahmed S. Darwish <darwish.07@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Alexander E. Patrakov <patrakov@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02f03c42
  7. 27 9月, 2019 4 次提交
    • Q
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · d4e20494
      Qu Wenruo 提交于
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e20494
    • Q
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · bab32fc0
      Qu Wenruo 提交于
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bab32fc0
    • P
      CIFS: Fix oplock handling for SMB 2.1+ protocols · a016e279
      Pavel Shilovsky 提交于
      There may be situations when a server negotiates SMB 2.1
      protocol version or higher but responds to a CREATE request
      with an oplock rather than a lease.
      
      Currently the client doesn't handle such a case correctly:
      when another CREATE comes in the server sends an oplock
      break to the initial CREATE and the client doesn't send
      an ack back due to a wrong caching level being set (READ
      instead of RWH). Missing an oplock break ack makes the
      server wait until the break times out which dramatically
      increases the latency of the second CREATE.
      
      Fix this by properly detecting oplocks when using SMB 2.1
      protocol version and higher.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      a016e279
    • S
      smb3: missing ACL related flags · ff3ee62a
      Steve French 提交于
      Various SMB3 ACL related flags (for security descriptor and
      ACEs for example) were missing and some fields are different
      in SMB3 and CIFS. Update cifsacl.h definitions based on
      current MS-DTYP specification.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      ff3ee62a