1. 21 4月, 2018 8 次提交
    • T
      fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error · d23a61ee
      Tetsuo Handa 提交于
      Commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") is
      printing spurious messages under memory pressure due to map_addr == -ENOMEM.
      
       9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
       14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
       16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already
      
      Complain only if -EEXIST, and use %px for printing the address.
      
      Link: http://lkml.kernel.org/r/201804182307.FAC17665.SFMOFJVFtHOLOQ@I-love.SAKURA.ne.jp
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") is
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d23a61ee
    • A
      proc: fix /proc/loadavg regression · 9a1015b3
      Alexey Dobriyan 提交于
      Commit 95846ecf ("pid: replace pid bitmap implementation with IDR
      API") changed last field of /proc/loadavg (last pid allocated) to be off
      by one:
      
      	# unshare -p -f --mount-proc cat /proc/loadavg
      	0.00 0.00 0.00 1/60 2	<===
      
      It should be 1 after first fork into pid namespace.
      
      This is formally a regression but given how useless this field is I
      don't think anyone is affected.
      
      Bug was found by /proc testsuite!
      
      Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
      Fixes: 95846ecf ("pid: replace pid bitmap implementation with IDR API")
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gargi Sharma <gs051095@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a1015b3
    • A
      proc: revalidate kernel thread inodes to root:root · 2e0ad552
      Alexey Dobriyan 提交于
      task_dump_owner() has the following code:
      
      	mm = task->mm;
      	if (mm) {
      		if (get_dumpable(mm) != SUID_DUMP_USER) {
      			uid = ...
      		}
      	}
      
      Check for ->mm is buggy -- kernel thread might be borrowing mm
      and inode will go to some random uid:gid pair.
      
      Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e0ad552
    • I
      autofs: mount point create should honour passed in mode · 1e630665
      Ian Kent 提交于
      The autofs file system mkdir inode operation blindly sets the created
      directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can
      cause selinux dac_override denials.
      
      But the function also checks if the caller is the daemon (as no-one else
      should be able to do anything here) so there's no point in not honouring
      the passed in mode, allowing the daemon to set appropriate mode when
      required.
      
      Link: http://lkml.kernel.org/r/152361593601.8051.14014139124905996173.stgit@pluto.themaw.netSigned-off-by: NIan Kent <raven@themaw.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e630665
    • G
      writeback: safer lock nesting · 2e898e4c
      Greg Thelen 提交于
      lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
      the page's memcg is undergoing move accounting, which occurs when a
      process leaves its memcg for a new one that has
      memory.move_charge_at_immigrate set.
      
      unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
      the given inode is switching writeback domains.  Switches occur when
      enough writes are issued from a new domain.
      
      This existing pattern is thus suspicious:
          lock_page_memcg(page);
          unlocked_inode_to_wb_begin(inode, &locked);
          ...
          unlocked_inode_to_wb_end(inode, locked);
          unlock_page_memcg(page);
      
      If both inode switch and process memcg migration are both in-flight then
      unlocked_inode_to_wb_end() will unconditionally enable interrupts while
      still holding the lock_page_memcg() irq spinlock.  This suggests the
      possibility of deadlock if an interrupt occurs before unlock_page_memcg().
      
          truncate
          __cancel_dirty_page
          lock_page_memcg
          unlocked_inode_to_wb_begin
          unlocked_inode_to_wb_end
          <interrupts mistakenly enabled>
                                          <interrupt>
                                          end_page_writeback
                                          test_clear_page_writeback
                                          lock_page_memcg
                                          <deadlock>
          unlock_page_memcg
      
      Due to configuration limitations this deadlock is not currently possible
      because we don't mix cgroup writeback (a cgroupv2 feature) and
      memory.move_charge_at_immigrate (a cgroupv1 feature).
      
      If the kernel is hacked to always claim inode switching and memcg
      moving_account, then this script triggers lockup in less than a minute:
      
        cd /mnt/cgroup/memory
        mkdir a b
        echo 1 > a/memory.move_charge_at_immigrate
        echo 1 > b/memory.move_charge_at_immigrate
        (
          echo $BASHPID > a/cgroup.procs
          while true; do
            dd if=/dev/zero of=/mnt/big bs=1M count=256
          done
        ) &
        while true; do
          sync
        done &
        sleep 1h &
        SLEEP=$!
        while true; do
          echo $SLEEP > a/cgroup.procs
          echo $SLEEP > b/cgroup.procs
        done
      
      The deadlock does not seem possible, so it's debatable if there's any
      reason to modify the kernel.  I suggest we should to prevent future
      surprises.  And Wang Long said "this deadlock occurs three times in our
      environment", so there's more reason to apply this, even to stable.
      Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
      see "[PATCH for-4.4] writeback: safer lock nesting"
      https://lkml.org/lkml/2018/4/11/146
      
      Wang Long said "this deadlock occurs three times in our environment"
      
      [gthelen@google.com: v4]
        Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
      [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
      Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
      Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
      Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>	[v4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e898e4c
    • H
      mm, pagemap: fix swap offset value for PMD migration entry · 88c28f24
      Huang Ying 提交于
      The swap offset reported by /proc/<pid>/pagemap may be not correct for
      PMD migration entries.  If addr passed into pagemap_pmd_range() isn't
      aligned with PMD start address, the swap offset reported doesn't
      reflect this.  And in the loop to report information of each sub-page,
      the swap offset isn't increased accordingly as that for PFN.
      
      This may happen after opening /proc/<pid>/pagemap and seeking to a page
      whose address doesn't align with a PMD start address.  I have verified
      this with a simple test program.
      
      BTW: migration swap entries have PFN information, do we need to restrict
      whether to show them?
      
      [akpm@linux-foundation.org: fix typo, per Huang, Ying]
      Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Jerome Glisse" <jglisse@redhat.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88c28f24
    • D
      vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion · a9e5b732
      David Howells 提交于
      In do_mount() when the MS_* flags are being converted to MNT_* flags,
      MS_RDONLY got accidentally convered to SB_RDONLY.
      
      Undo this change.
      
      Fixes: e462ec50 ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5b732
    • D
      afs: Fix server record deletion · 66062592
      David Howells 提交于
      AFS server records get removed from the net->fs_servers tree when
      they're deleted, but not from the net->fs_addresses{4,6} lists, which
      can lead to an oops in afs_find_server() when a server record has been
      removed, for instance during rmmod.
      
      Fix this by deleting the record from the by-address lists before posting
      it for RCU destruction.
      
      The reason this hasn't been noticed before is that the fileserver keeps
      probing the local cache manager, thereby keeping the service record
      alive, so the oops would only happen when a fileserver eventually gets
      bored and stops pinging or if the module gets rmmod'd and a call comes
      in from the fileserver during the window between the server records
      being destroyed and the socket being closed.
      
      The oops looks something like:
      
        BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
        ...
        Workqueue: kafsd afs_process_async_call [kafs]
        RIP: 0010:afs_find_server+0x271/0x36f [kafs]
        ...
        Call Trace:
         afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
         afs_deliver_to_call+0x1ee/0x5e8 [kafs]
         afs_process_async_call+0x5b/0xd0 [kafs]
         process_one_work+0x2c2/0x504
         worker_thread+0x1d4/0x2ac
         kthread+0x11f/0x127
         ret_from_fork+0x24/0x30
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66062592
  2. 20 4月, 2018 1 次提交
  3. 18 4月, 2018 1 次提交
    • J
      udf: Fix leak of UTF-16 surrogates into encoded strings · 44f06ba8
      Jan Kara 提交于
      OSTA UDF specification does not mention whether the CS0 charset in case
      of two bytes per character encoding should be treated in UTF-16 or
      UCS-2. The sample code in the standard does not treat UTF-16 surrogates
      in any special way but on systems such as Windows which work in UTF-16
      internally, filenames would be treated as being in UTF-16 effectively.
      In Linux it is more difficult to handle characters outside of Base
      Multilingual plane (beyond 0xffff) as NLS framework works with 2-byte
      characters only. Just make sure we don't leak UTF-16 surrogates into the
      resulting string when loading names from the filesystem for now.
      
      CC: stable@vger.kernel.org # >= v4.6
      Reported-by: NMingye Wang <arthur200126@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      44f06ba8
  4. 17 4月, 2018 1 次提交
    • T
      eCryptfs: don't pass up plaintext names when using filename encryption · e86281e7
      Tyler Hicks 提交于
      Both ecryptfs_filldir() and ecryptfs_readlink_lower() use
      ecryptfs_decode_and_decrypt_filename() to translate lower filenames to
      upper filenames. The function correctly passes up lower filenames,
      unchanged, when filename encryption isn't in use. However, it was also
      passing up lower filenames when the filename wasn't encrypted or
      when decryption failed. Since 88ae4ab9, eCryptfs refuses to lookup
      lower plaintext names when filename encryption is enabled so this
      resulted in a situation where userspace would see lower plaintext
      filenames in calls to getdents(2) but then not be able to lookup those
      filenames.
      
      An example of this can be seen when enabling filename encryption on an
      eCryptfs mount at the root directory of an Ext4 filesystem:
      
      $ ls -1i /lower
      12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
      11 lost+found
      $ ls -1i /upper
      ls: cannot access '/upper/lost+found': No such file or directory
       ? lost+found
      12 test
      
      With this change, the lower lost+found dentry is ignored:
      
      $ ls -1i /lower
      12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
      11 lost+found
      $ ls -1i /upper
      12 test
      
      Additionally, some potentially noisy error/info messages in the related
      code paths are turned into debug messages so that the logs can't be
      easily filled.
      
      Fixes: 88ae4ab9 ("ecryptfs_lookup(): try either only encrypted or plaintext name")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      e86281e7
  5. 16 4月, 2018 6 次提交
  6. 14 4月, 2018 1 次提交
  7. 13 4月, 2018 13 次提交
  8. 12 4月, 2018 9 次提交
    • D
      btrfs: add SPDX header to Kconfig · 852eb3ae
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      852eb3ae
    • D
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba 提交于
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1d7c514
    • D
      btrfs: replace GPL boilerplate by SPDX -- headers · 9888c340
      David Sterba 提交于
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      
      Unify the include protection macros to match the file names.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9888c340
    • F
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · 471d557a
      Filipe Manana 提交于
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      471d557a
    • L
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo 提交于
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af722733
    • A
      ovl: add support for "xino" mount and config options · 795939a9
      Amir Goldstein 提交于
      With mount option "xino=on", mounter declares that there are enough
      free high bits in underlying fs to hold the layer fsid.
      If overlayfs does encounter underlying inodes using the high xino
      bits reserved for layer fsid, a warning will be emitted and the original
      inode number will be used.
      
      The mount option name "xino" goes after a similar meaning mount option
      of aufs, but in overlayfs case, the mapping is stateless.
      
      An example for a use case of "xino=on" is when upper/lower is on an xfs
      filesystem. xfs uses 64bit inode numbers, but it currently never uses the
      upper 8bit for inode numbers exposed via stat(2) and that is not likely to
      change in the future without user opting-in for a new xfs feature. The
      actual number of unused upper bit is much larger and determined by the xfs
      filesystem geometry (64 - agno_log - agblklog - inopblog). That means
      that for all practical purpose, there are enough unused bits in xfs
      inode numbers for more than OVL_MAX_STACK unique fsid's.
      
      Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
      numbers are allocated sequentially since boot, so they will practially
      never use the high inode number bits.
      
      For compatibility with applications that expect 32bit inodes, the feature
      can be disabled with "xino=off". The option "xino=auto" automatically
      detects underlying filesystem that use 32bit inodes and enables the
      feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
      the same name, determine if the default mode for overlayfs mount is
      "xino=auto" or "xino=off".
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      795939a9
    • A
      ovl: consistent d_ino for non-samefs with xino · adbf4f7e
      Amir Goldstein 提交于
      When overlay layers are not all on the same fs, but all inode numbers
      of underlying fs do not use the high 'xino' bits, overlay st_ino values
      are constant and persistent.
      
      In that case, relax non-samefs constraint for consistent d_ino and always
      iterate non-merge dir using ovl_fill_real() actor so we can remap lower
      inode numbers to unique lower fs range.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      adbf4f7e
    • A
      ovl: consistent i_ino for non-samefs with xino · 12574a9f
      Amir Goldstein 提交于
      When overlay layers are not all on the same fs, but all inode numbers
      of underlying fs do not use the high 'xino' bits, overlay st_ino values
      are constant and persistent.
      
      In that case, set i_ino value to the same value as st_ino for nfsd
      readdirplus validator.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      12574a9f
    • A
      ovl: constant st_ino for non-samefs with xino · e487d889
      Amir Goldstein 提交于
      On 64bit systems, when overlay layers are not all on the same fs, but
      all inode numbers of underlying fs are not using the high bits, use the
      high bits to partition the overlay st_ino address space.  The high bits
      hold the fsid (upper fsid is 0).  This way overlay inode numbers are unique
      and all inodes use overlay st_dev.  Inode numbers are also persistent
      for a given layer configuration.
      
      Currently, our only indication for available high ino bits is from a
      filesystem that supports file handles and uses the default encode_fh()
      operation, which encodes a 32bit inode number.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e487d889