1. 24 1月, 2018 6 次提交
  2. 20 1月, 2018 4 次提交
    • A
      ovl: take mnt_want_write() for removing impure xattr · a5a927a7
      Amir Goldstein 提交于
      The optimization in ovl_cache_get_impure() that tries to remove an
      unneeded "impure" xattr needs to take mnt_want_write() on upper fs.
      
      Fixes: 4edb83bb ("ovl: constant d_ino for non-merge dirs")
      Cc: <stable@vger.kernel.org> #v4.14
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a5a927a7
    • A
      ovl: take mnt_want_write() for work/index dir setup · 2ba9d57e
      Amir Goldstein 提交于
      There are several write operations on upper fs not covered by
      mnt_want_write():
      
      - test set/remove OPAQUE xattr
      - test create O_TMPFILE
      - set ORIGIN xattr in ovl_verify_origin()
      - cleanup of index entries in ovl_indexdir_cleanup()
      
      Some of these go way back, but this patch only applies over the
      v4.14 re-factoring of ovl_fill_super().
      
      Cc: <stable@vger.kernel.org> #v4.14
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2ba9d57e
    • A
      ovl: fix another overlay: warning prefix · f8167817
      Amir Goldstein 提交于
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f8167817
    • A
      ovl: take lower dir inode mutex outside upper sb_writers lock · 6d0a8a90
      Amir Goldstein 提交于
      The functions ovl_lower_positive() and ovl_check_empty_dir() both take
      inode mutex on the real lower dir under ovl_want_write() which takes
      the upper_mnt sb_writers lock.
      
      While this is not a clear locking order or layering violation, it creates
      an undesired lock dependency between two unrelated layers for no good
      reason.
      
      This lock dependency materializes to a false(?) positive lockdep warning
      when calling rmdir() on a nested overlayfs, where both nested and
      underlying overlayfs both use the same fs type as upper layer.
      
      rmdir() on the nested overlayfs creates the lock chain:
        sb_writers of upper_mnt (e.g. tmpfs) in ovl_do_remove()
        ovl_i_mutex_dir_key[] of lower overlay dir in ovl_lower_positive()
      
      rmdir() on the underlying overlayfs creates the lock chain in
      reverse order:
        ovl_i_mutex_dir_key[] of lower overlay dir in vfs_rmdir()
        sb_writers of nested upper_mnt (e.g. tmpfs) in ovl_do_remove()
      
      To rid of the unneeded locking dependency, move both ovl_lower_positive()
      and ovl_check_empty_dir() to before ovl_want_write() in rmdir() and
      rename() implementation.
      
      This change spreads the pieces of ovl_check_empty_and_clear() directly
      inside the rmdir()/rename() implementations so the helper is no longer
      needed and removed.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6d0a8a90
  3. 19 1月, 2018 2 次提交
    • A
      ovl: fix failure to fsync lower dir · d796e77f
      Amir Goldstein 提交于
      As a writable mount, it is not expected for overlayfs to return
      EINVAL/EROFS for fsync, even if dir/file is not changed.
      
      This commit fixes the case of fsync of directory, which is easier to
      address, because overlayfs already implements fsync file operation for
      directories.
      
      The problem reported by Raphael is that new PostgreSQL 10.0 with a
      database in overlayfs where lower layer in squashfs fails to start.
      The failure is due to fsync error, when PostgreSQL does fsync on all
      existing db directories on startup and a specific directory exists
      lower layer with no changes.
      Reported-by: NRaphael Hertzog <raphael@ouaza.com>
      Cc: <stable@vger.kernel.org> # v3.18
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Tested-by: NRaphaël Hertzog <hertzog@debian.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d796e77f
    • A
      ovl: hash directory inodes for fsnotify · 31747eda
      Amir Goldstein 提交于
      fsnotify pins a watched directory inode in cache, but if directory dentry
      is released, new lookup will allocate a new dentry and a new inode.
      Directory events will be notified on the new inode, while fsnotify listener
      is watching the old pinned inode.
      
      Hash all directory inodes to reuse the pinned inode on lookup. Pure upper
      dirs are hashes by real upper inode, merge and lower dirs are hashed by
      real lower inode.
      
      The reference to lower inode was being held by the lower dentry object
      in the overlay dentry (oe->lowerstack[0]). Releasing the overlay dentry
      may drop lower inode refcount to zero. Add a refcount on behalf of the
      overlay inode to prevent that.
      
      As a by-product, hashing directory inodes also detects multiple
      redirected dirs to the same lower dir and uncovered redirected dir
      target on and returns -ESTALE on lookup.
      
      The reported issue dates back to initial version of overlayfs, but this
      patch depends on ovl_inode code that was introduced in kernel v4.13.
      
      Cc: <stable@vger.kernel.org> #v4.13
      Reported-by: NNiklas Cassel <niklas.cassel@axis.com>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: NNiklas Cassel <niklas.cassel@axis.com>
      31747eda
  4. 05 1月, 2018 1 次提交
  5. 04 1月, 2018 1 次提交
  6. 03 1月, 2018 5 次提交
  7. 02 1月, 2018 3 次提交
  8. 22 12月, 2017 6 次提交
  9. 21 12月, 2017 1 次提交
  10. 19 12月, 2017 1 次提交
  11. 18 12月, 2017 2 次提交
  12. 17 12月, 2017 1 次提交
  13. 16 12月, 2017 3 次提交
    • L
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds 提交于
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
      
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
      
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      
      We'll see.
      
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f37321
    • S
      nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests · dc4fd9ab
      Scott Mayhew 提交于
      If there were no commit requests, then nfs_commit_inode() should not
      wait on the commit or mark the inode dirty, otherwise the following
      BUG_ON can be triggered:
      
      [ 1917.130762] kernel BUG at fs/inode.c:578!
      [ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
      [ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
      [ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
      [ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G               ------------ T 3.10.0-768.el7.ppc64 #1
      [ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000
      [ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80
      [ 1917.130816] REGS: c00000004cea3720 TRAP: 0700   Tainted: G               ------------ T  (3.10.0-768.el7.ppc64)
      [ 1917.130820] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI>  CR: 22002822  XER: 20000000
      [ 1917.130828] CFAR: c00000000011f594 SOFTE: 1
      GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
      GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
      GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
      GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
      GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
      [ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350
      [ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350
      [ 1917.130873] Call Trace:
      [ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable)
      [ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270
      [ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0
      [ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
      [ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
      [ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180
      [ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150
      [ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100
      [ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
      [ 1917.130919] Instruction dump:
      [ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
      [ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 <0b0a0000> 892d02a4 2f890000 40de0134
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Cc: stable@vger.kernel.org # 4.5+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      dc4fd9ab
    • S
      nfs: fix a deadlock in nfs client initialization · c156618e
      Scott Mayhew 提交于
      The following deadlock can occur between a process waiting for a client
      to initialize in while walking the client list during nfsv4 server trunking
      detection and another process waiting for the nfs_clid_init_mutex so it
      can initialize that client:
      
      Process 1                               Process 2
      ---------                               ---------
      spin_lock(&nn->nfs_client_lock);
      list_add_tail(&CLIENTA->cl_share_link,
              &nn->nfs_client_list);
      spin_unlock(&nn->nfs_client_lock);
                                              spin_lock(&nn->nfs_client_lock);
                                              list_add_tail(&CLIENTB->cl_share_link,
                                                      &nn->nfs_client_list);
                                              spin_unlock(&nn->nfs_client_lock);
                                              mutex_lock(&nfs_clid_init_mutex);
                                              nfs41_walk_client_list(clp, result, cred);
                                              nfs_wait_client_init_complete(CLIENTA);
      (waiting for nfs_clid_init_mutex)
      
      Make sure nfs_match_client() only evaluates clients that have completed
      initialization in order to prevent that deadlock.
      
      This patch also fixes v4.0 trunking behavior by not marking the client
      NFS_CS_READY until the clientid has been confirmed.
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c156618e
  14. 15 12月, 2017 4 次提交