1. 29 12月, 2018 2 次提交
    • C
      Revert "vfs: Allow userns root to call mknod on owned filesystems." · 9c5ccadb
      Christian Brauner 提交于
      commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream.
      
      This reverts commit 55956b59.
      
      commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
      enabled mknod() in user namespaces for userns root if CAP_MKNOD is
      available. However, these device nodes are useless since any filesystem
      mounted from a non-initial user namespace will set the SB_I_NODEV flag on
      the filesystem. Now, when a device node s created in a non-initial user
      namespace a call to open() on said device node will fail due to:
      
      bool may_open_dev(const struct path *path)
      {
              return !(path->mnt->mnt_flags & MNT_NODEV) &&
                      !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
      }
      
      The problem with this is that as of the aforementioned commit mknod()
      creates partially functional device nodes in non-initial user namespaces.
      In particular, it has the consequence that as of the aforementioned commit
      open() will be more privileged with respect to device nodes than mknod().
      Before it was the other way around. Specifically, if mknod() succeeded
      then it was transparent for any userspace application that a fatal error
      must have occured when open() failed.
      
      All of this breaks multiple userspace workloads and a widespread assumption
      about how to handle mknod(). Basically, all container runtimes and systemd
      live by the slogan "ask for forgiveness not permission" when running user
      namespace workloads. For mknod() the assumption is that if the syscall
      succeeds the device nodes are useable irrespective of whether it succeeds
      in a non-initial user namespace or not. This logic was chosen explicitly
      to allow for the glorious day when mknod() will actually be able to create
      fully functional device nodes in user namespaces.
      A specific problem people are already running into when running 4.18 rc
      kernels are failing systemd services. For any distro that is run in a
      container systemd services started with the PrivateDevices= property set
      will fail to start since the device nodes in question cannot be
      opened (cf. the arguments in [1]).
      
      Full disclosure, Seth made the very sound argument that it is already
      possible to end up with partially functional device nodes. Any filesystem
      mounted with MS_NODEV set will allow mknod() to succeed but will not allow
      open() to succeed. The difference to the case here is that the MS_NODEV
      case is transparent to userspace since it is an explicitly set mount option
      while the SB_I_NODEV case is an implicit property enforced by the kernel
      and hence opaque to userspace.
      
      [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c5ccadb
    • D
      iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()" · 38d072a4
      Dave Chinner 提交于
      [ Upstream commit a837eca2412051628c0529768c9bc4f3580b040e ]
      
      This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.
      
      The reverted commit added page reference counting to iomap page
      structures that are used to track block size < page size state. This
      was supposed to align the code with page migration page accounting
      assumptions, but what it has done instead is break XFS filesystems.
      Every fstests run I've done on sub-page block size XFS filesystems
      has since picking up this commit 2 days ago has failed with bad page
      state errors such as:
      
      # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
      ....
      SECTION       -- xfs
      FSTYP         -- xfs (debug)
      PLATFORM      -- Linux/x86_64 test1 4.20.0-rc6-dgc+
      MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
      MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
      
      generic/038 454s ...
       run fstests generic/038 at 2018-12-20 18:43:05
       XFS (sdc): Unmounting Filesystem
       XFS (sdc): Mounting V5 Filesystem
       XFS (sdc): Ending clean mount
       BUG: Bad page state in process kswapd0  pfn:3a7fa
       page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
       flags: 0xfffffc0000000()
       raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
       raw: 0000000000000001 0000000000000000 00000000ffffffff
       page dumped because: non-NULL mapping
       CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
       Call Trace:
        dump_stack+0x67/0x90
        bad_page.cold.116+0x8a/0xbd
        free_pcppages_bulk+0x4bf/0x6a0
        free_unref_page_list+0x10f/0x1f0
        shrink_page_list+0x49d/0xf50
        shrink_inactive_list+0x19d/0x3b0
        shrink_node_memcg.constprop.77+0x398/0x690
        ? shrink_slab.constprop.81+0x278/0x3f0
        shrink_node+0x7a/0x2f0
        kswapd+0x34b/0x6d0
        ? node_reclaim+0x240/0x240
        kthread+0x11f/0x140
        ? __kthread_bind_mask+0x60/0x60
        ret_from_fork+0x24/0x30
       Disabling lock debugging due to kernel taint
      ....
      
      The failures are from anyway that frees pages and empties the
      per-cpu page magazines, so it's not a predictable failure or an easy
      to debug failure.
      
      generic/038 is a reliable reproducer of this problem - it has a 9 in
      10 failure rate on one of my test machines. Failure on other
      machines have been at random points in fstests runs but every run
      has ended up tripping this problem. Hence generic/038 was used to
      bisect the failure because it was the most reliable failure.
      
      It is too close to the 4.20 release (not to mention holidays) to
      try to diagnose, fix and test the underlying cause of the problem,
      so reverting the commit is the only option we have right now. The
      revert has been tested against a current tot 4.20-rc7+ kernel across
      multiple machines running sub-page block size XFs filesystems and
      none of the bad page state failures have been seen.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Brian Foster <bfoster@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38d072a4
  2. 21 12月, 2018 38 次提交