1. 23 4月, 2021 8 次提交
  2. 24 3月, 2021 1 次提交
  3. 16 3月, 2021 2 次提交
  4. 30 1月, 2021 1 次提交
  5. 24 1月, 2021 3 次提交
    • C
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner 提交于
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • C
      stat: handle idmapped mounts · 0d56a451
      Christian Brauner 提交于
      The generic_fillattr() helper fills in the basic attributes associated
      with an inode. Enable it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount map it into the mount's user
      namespace before we store the uid and gid. If the initial user namespace
      is passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0d56a451
    • C
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner 提交于
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
  6. 04 1月, 2021 2 次提交
    • D
      afs: Fix directory entry size calculation · 366911cd
      David Howells 提交于
      The number of dirent records used by an AFS directory entry should be
      calculated using the assumption that there is a 16-byte name field in the
      first block, rather than a 20-byte name field (which is actually the case).
      This miscalculation is historic and effectively standard, so we have to use
      it.
      
      The calculation we need to use is:
      
      	1 + (((strlen(name) + 1) + 15) >> 5)
      
      where we are adding one to the strlen() result to account for the NUL
      termination.
      
      Fix this by the following means:
      
       (1) Create an inline function to do the calculation for a given name
           length.
      
       (2) Use the function to calculate the number of records used for a dirent
           in afs_dir_iterate_block().
      
           Use this to move the over-end check out of the loop since it only
           needs to be done once.
      
           Further use this to only go through the loop for the 2nd+ records
           composing an entry.  The only test there now is for if the record is
           allocated - and we already checked the first block at the top of the
           outer loop.
      
       (3) Add a max name length check in afs_dir_iterate_block().
      
       (4) Make afs_edit_dir_add() and afs_edit_dir_remove() use the function
           from (1) to calculate the number of blocks rather than doing it
           incorrectly themselves.
      
      Fixes: 63a4681f ("afs: Locally edit directory data for mkdir/create/unlink/...")
      Fixes: ^1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      366911cd
    • D
      afs: Work around strnlen() oops with CONFIG_FORTIFIED_SOURCE=y · 26982a89
      David Howells 提交于
      AFS has a structured layout in its directory contents (AFS dirs are
      downloaded as files and parsed locally by the client for lookup/readdir).
      The slots in the directory are defined by union afs_xdr_dirent.  This,
      however, only directly allows a name of a length that will fit into that
      union.  To support a longer name, the next 1-8 contiguous entries are
      annexed to the first one and the name flows across these.
      
      afs_dir_iterate_block() uses strnlen(), limited to the space to the end of
      the page, to find out how long the name is.  This worked fine until
      6a39e62a.  With that commit, the compiler determines the size of the
      array and asserts that the string fits inside that array.  This is a
      problem for AFS because we *expect* it to overflow one or more arrays.
      
      A similar problem also occurs in afs_dir_scan_block() when a directory file
      is being locally edited to avoid the need to redownload it.  There strlen()
      was being used safely because each page has the last byte set to 0 when the
      file is downloaded and validated (in afs_dir_check_page()).
      
      Fix this by changing the afs_xdr_dirent union name field to an
      indeterminate-length array and dropping the overflow field.
      
      (Note that whilst looking at this, I realised that the calculation of the
      number of slots a dirent used is non-standard and not quite right, but I'll
      address that in a separate patch.)
      
      The issue can be triggered by something like:
      
              touch /afs/example.com/thisisaveryveryverylongname
      
      and it generates a report that looks like:
      
              detected buffer overflow in strnlen
              ------------[ cut here ]------------
              kernel BUG at lib/string.c:1149!
              ...
              RIP: 0010:fortify_panic+0xf/0x11
              ...
              Call Trace:
               afs_dir_iterate_block+0x12b/0x35b
               afs_dir_iterate+0x14e/0x1ce
               afs_do_lookup+0x131/0x417
               afs_lookup+0x24f/0x344
               lookup_open.isra.0+0x1bb/0x27d
               open_last_lookups+0x166/0x237
               path_openat+0xe0/0x159
               do_filp_open+0x48/0xa4
               ? kmem_cache_alloc+0xf5/0x16e
               ? __clear_close_on_exec+0x13/0x22
               ? _raw_spin_unlock+0xa/0xb
               do_sys_openat2+0x72/0xde
               do_sys_open+0x3b/0x58
               do_syscall_64+0x2d/0x3a
               entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 6a39e62a ("lib: string.h: detect intra-object overflow in fortified string functions")
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      cc: Daniel Axtens <dja@axtens.net>
      26982a89
  7. 09 12月, 2020 1 次提交
    • D
      afs: Fix memory leak when mounting with multiple source parameters · 4cb68296
      David Howells 提交于
      There's a memory leak in afs_parse_source() whereby multiple source=
      parameters overwrite fc->source in the fs_context struct without freeing
      the previously recorded source.
      
      Fix this by only permitting a single source parameter and rejecting with
      an error all subsequent ones.
      
      This was caught by syzbot with the kernel memory leak detector, showing
      something like the following trace:
      
        unreferenced object 0xffff888114375440 (size 32):
          comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
          backtrace:
            slab_post_alloc_hook+0x42/0x79
            __kmalloc_track_caller+0x125/0x16a
            kmemdup_nul+0x24/0x3c
            vfs_parse_fs_string+0x5a/0xa1
            generic_parse_monolithic+0x9d/0xc5
            do_new_mount+0x10d/0x15a
            do_mount+0x5f/0x8e
            __do_sys_mount+0xff/0x127
            do_syscall_64+0x2d/0x3a
            entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 13fcc683 ("afs: Add fs_context support")
      Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cb68296
  8. 23 11月, 2020 1 次提交
    • D
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells 提交于
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
  9. 15 11月, 2020 1 次提交
  10. 04 11月, 2020 2 次提交
    • D
      afs: Fix incorrect freeing of the ACL passed to the YFS ACL store op · f4c79144
      David Howells 提交于
      The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
      function to destroy the ACL content buffer.  It's an afs_acl struct, not
      a yfs_acl struct - and the free function for latter may pass invalid
      pointers to kfree().
      
      Fix this by using the afs_acl_put() function.  The yfs_acl_put()
      function is then no longer used and can be removed.
      
      	general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
      	...
      	RIP: 0010:compound_head+0x0/0x11
      	...
      	Call Trace:
      	 virt_to_cache+0x8/0x51
      	 kfree+0x5d/0x79
      	 yfs_free_opaque_acl+0x16/0x29
      	 afs_put_operation+0x60/0x114
      	 __vfs_setxattr+0x67/0x72
      	 __vfs_setxattr_noperm+0x66/0xe9
      	 vfs_setxattr+0x67/0xce
      	 setxattr+0x14e/0x184
      	 __do_sys_fsetxattr+0x66/0x8f
      	 do_syscall_64+0x2d/0x3a
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c79144
    • D
      afs: Fix warning due to unadvanced marshalling pointer · c80afa1d
      David Howells 提交于
      When using the afs.yfs.acl xattr to change an AuriStor ACL, a warning
      can be generated when the request is marshalled because the buffer
      pointer isn't increased after adding the last element, thereby
      triggering the check at the end if the ACL wasn't empty.  This just
      causes something like the following warning, but doesn't stop the call
      from happening successfully:
      
          kAFS: YFS.StoreOpaqueACL2: Request buffer underflow (36<108)
      
      Fix this simply by increasing the count prior to the check.
      
      Fixes: f5e45463 ("afs: Implement YFS ACL setting")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c80afa1d
  11. 29 10月, 2020 7 次提交
    • D
      afs: Fix dirty-region encoding on ppc32 with 64K pages · 2d9900f2
      David Howells 提交于
      The dirty region bounds stored in page->private on an afs page are 15 bits
      on a 32-bit box and can, at most, represent a range of up to 32K within a
      32K page with a resolution of 1 byte.  This is a problem for powerpc32 with
      64K pages enabled.
      
      Further, transparent huge pages may get up to 2M, which will be a problem
      for the afs filesystem on all 32-bit arches in the future.
      
      Fix this by decreasing the resolution.  For the moment, a 64K page will
      have a resolution determined from PAGE_SIZE.  In the future, the page will
      need to be passed in to the helper functions so that the page size can be
      assessed and the resolution determined dynamically.
      
      Note that this might not be the ideal way to handle this, since it may
      allow some leakage of undirtied zero bytes to the server's copy in the case
      of a 3rd-party conflict.  Fixing that would require a separately allocated
      record and is a more complicated fix.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      2d9900f2
    • D
      afs: Fix afs_invalidatepage to adjust the dirty region · f86726a6
      David Howells 提交于
      Fix afs_invalidatepage() to adjust the dirty region recorded in
      page->private when truncating a page.  If the dirty region is entirely
      removed, then the private data is cleared and the page dirty state is
      cleared.
      
      Without this, if the page is truncated and then expanded again by truncate,
      zeros from the expanded, but no-longer dirty region may get written back to
      the server if the page gets laundered due to a conflicting 3rd-party write.
      
      It mustn't, however, shorten the dirty region of the page if that page is
      still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
      stored in page->private to record this.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f86726a6
    • D
      afs: Alter dirty range encoding in page->private · 65dd2d60
      David Howells 提交于
      Currently, page->private on an afs page is used to store the range of
      dirtied data within the page, where the range includes the lower bound, but
      excludes the upper bound (e.g. 0-1 is a range covering a single byte).
      
      This, however, requires a superfluous bit for the last-byte bound so that
      on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
      being that having both numbers the same would indicate an empty range.
      This is unnecessary as the PG_private bit is clear if it's an empty range
      (as is PG_dirty).
      
      Alter the way the dirty range is encoded in page->private such that the
      upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
      byte range mentioned above).
      
      Applying this to both bounds frees up two bits, one of which can be used in
      a future commit.
      
      This allows the afs filesystem to be compiled on ppc32 with 64K pages;
      without this, the following warnings are seen:
      
      ../fs/afs/internal.h: In function 'afs_page_dirty_to':
      ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
        881 |  return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
            |               ^~
      ../fs/afs/internal.h: In function 'afs_page_dirty':
      ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
        886 |  return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
            |                            ^~
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      65dd2d60
    • D
      afs: Wrap page->private manipulations in inline functions · 185f0c70
      David Howells 提交于
      The afs filesystem uses page->private to store the dirty range within a
      page such that in the event of a conflicting 3rd-party write to the server,
      we write back just the bits that got changed locally.
      
      However, there are a couple of problems with this:
      
       (1) I need a bit to note if the page might be mapped so that partial
           invalidation doesn't shrink the range.
      
       (2) There aren't necessarily sufficient bits to store the entire range of
           data altered (say it's a 32-bit system with 64KiB pages or transparent
           huge pages are in use).
      
      So wrap the accesses in inline functions so that future commits can change
      how this works.
      
      Also move them out of the tracing header into the in-directory header.
      There's not really any need for them to be in the tracing header.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      185f0c70
    • D
      afs: Fix where page->private is set during write · f792e3ac
      David Howells 提交于
      In afs, page->private is set to indicate the dirty region of a page.  This
      is done in afs_write_begin(), but that can't take account of whether the
      copy into the page actually worked.
      
      Fix this by moving the change of page->private into afs_write_end().
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f792e3ac
    • D
      afs: Fix page leak on afs_write_begin() failure · 21db2cdc
      David Howells 提交于
      Fix the leak of the target page in afs_write_begin() when it fails.
      
      Fixes: 15b4650e ("afs: convert to new aops")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Nick Piggin <npiggin@gmail.com>
      21db2cdc
    • D
      afs: Fix to take ref on page when PG_private is set · fa04a40b
      David Howells 提交于
      Fix afs to take a ref on a page when it sets PG_private on it and to drop
      the ref when removing the flag.
      
      Note that in afs_write_begin(), a lot of the time, PG_private is already
      set on a page to which we're going to add some data.  In such a case, we
      leave the bit set and mustn't increment the page count.
      
      As suggested by Matthew Wilcox, use attach/detach_page_private() where
      possible.
      
      Fixes: 31143d5d ("AFS: implement basic file write support")
      Reported-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      fa04a40b
  12. 28 10月, 2020 4 次提交
  13. 16 10月, 2020 6 次提交
    • D
      afs: Don't assert on unpurgeable server records · 7530d3eb
      David Howells 提交于
      Don't give an assertion failure on unpurgeable afs_server records - which
      kills the thread - but rather emit a trace line when we are purging a
      record (which only happens during network namespace removal or rmmod) and
      print a notice of the problem.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7530d3eb
    • D
      afs: Add tracing for cell refcount and active user count · dca54a7b
      David Howells 提交于
      Add a tracepoint to log the cell refcount and active user count and pass in
      a reason code through various functions that manipulate these counters.
      
      Additionally, a helper function, afs_see_cell(), is provided to log
      interesting places that deal with a cell without actually doing any
      accounting directly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dca54a7b
    • D
      afs: Fix cell removal · 1d0e850a
      David Howells 提交于
      Fix cell removal by inserting a more final state than AFS_CELL_FAILED that
      indicates that the cell has been unpublished in case the manager is already
      requeued and will go through again.  The new AFS_CELL_REMOVED state will
      just immediately leave the manager function.
      
      Going through a second time in the AFS_CELL_FAILED state will cause it to
      try to remove the cell again, potentially leading to the proc list being
      removed.
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com
      Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com
      Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com
      Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com
      Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com
      Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com
      Suggested-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hillf Danton <hdanton@sina.com>
      1d0e850a
    • D
      afs: Fix cell purging with aliases · 286377f6
      David Howells 提交于
      When the afs module is removed, one of the things that has to be done is to
      purge the cell database.  afs_cell_purge() cancels the management timer and
      then starts the cell manager work item to do the purging.  This does a
      single run through and then assumes that all cells are now purged - but
      this is no longer the case.
      
      With the introduction of alias detection, a later cell in the database can
      now be holding an active count on an earlier cell (cell->alias_of).  The
      purge scan passes by the earlier cell first, but this can't be got rid of
      until it has discarded the alias.  Ordinarily, afs_unuse_cell() would
      handle this by setting the management timer to trigger another pass - but
      afs_set_cell_timer() doesn't do anything if the namespace is being removed
      (net->live == false).  rmmod then hangs in the wait on cells_outstanding in
      afs_cell_purge().
      
      Fix this by making afs_set_cell_timer() directly queue the cell manager if
      net->live is false.  This causes additional management passes.
      
      Queueing the cell manager increments cells_outstanding to make sure the
      wait won't complete until all cells are destroyed.
      
      Fixes: 8a070a96 ("afs: Detect cell aliases 1 - Cells with root volumes")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      286377f6
    • D
      afs: Fix cell refcounting by splitting the usage counter · 88c853c3
      David Howells 提交于
      Management of the lifetime of afs_cell struct has some problems due to the
      usage counter being used to determine whether objects of that type are in
      use in addition to whether anyone might be interested in the structure.
      
      This is made trickier by cell objects being cached for a period of time in
      case they're quickly reused as they hold the result of a setup process that
      may be slow (DNS lookups, AFS RPC ops).
      
      Problems include the cached root volume from alias resolution pinning its
      parent cell record, rmmod occasionally hanging and occasionally producing
      assertion failures.
      
      Fix this by splitting the count of active users from the struct reference
      count.  Things then work as follows:
      
       (1) The cell cache keeps +1 on the cell's activity count and this has to
           be dropped before the cell can be removed.  afs_manage_cell() tries to
           exchange the 1 to a 0 with the cells_lock write-locked, and if
           successful, the record is removed from the net->cells.
      
       (2) One struct ref is 'owned' by the activity count.  That is put when the
           active count is reduced to 0 (final_destruction label).
      
       (3) A ref can be held on a cell whilst it is queued for management on a
           work queue without confusing the active count.  afs_queue_cell() is
           added to wrap this.
      
       (4) The queue's ref is dropped at the end of the management.  This is
           split out into a separate function, afs_manage_cell_work().
      
       (5) The root volume record is put after a cell is removed (at the
           final_destruction label) rather then in the RCU destruction routine.
      
       (6) Volumes hold struct refs, but aren't active users.
      
       (7) Both counts are displayed in /proc/net/afs/cells.
      
      There are some management function changes:
      
       (*) afs_put_cell() now just decrements the refcount and triggers the RCU
           destruction if it becomes 0.  It no longer sets a timer to have the
           manager do this.
      
       (*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
           the active count.  afs_unuse_cell() sets the management timer.
      
       (*) afs_queue_cell() is added to queue a cell with approprate refs.
      
      There are also some other fixes:
      
       (*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.
      
       (*) Make sure that candidate cells in lookups are properly destroyed
           rather than being simply kfree'd.  This ensures the bits it points to
           are destroyed also.
      
       (*) afs_dec_cells_outstanding() is now called in cell destruction rather
           than at "final_destruction".  This ensures that cell->net is still
           valid to the end of the destructor.
      
       (*) As a consequence of the previous two changes, move the increment of
           net->cells_outstanding that was at the point of insertion into the
           tree to the allocation routine to correctly balance things.
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      88c853c3
    • D
      afs: Fix rapid cell addition/removal by not using RCU on cells tree · 92e3cc91
      David Howells 提交于
      There are a number of problems that are being seen by the rapidly mounting
      and unmounting an afs dynamic root with an explicit cell and volume
      specified (which should probably be rejected, but that's a separate issue):
      
      What the tests are doing is to look up/create a cell record for the name
      given and then tear it down again without actually using it to try to talk
      to a server.  This is repeated endlessly, very fast, and the new cell
      collides with the old one if it's not quick enough to reuse it.
      
      It appears (as suggested by Hillf Danton) that the search through the RB
      tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
      that it's not blocking the write_seqlock(), despite taking two passes at
      it.  He suggested that the code should take a ref on the cell it's
      attempting to look at - but this shouldn't be necessary until we've
      compared the cell names.  It's possible that I'm missing a barrier
      somewhere.
      
      However, using an RCU search for this is overkill, really - we only need to
      access the cell name in a few places, and they're places where we're may
      end up sleeping anyway.
      
      Fix this by switching to an R/W semaphore instead.
      
      Additionally, draw the down_read() call inside the function (renamed to
      afs_find_cell()) since all the callers were taking the RCU read lock (or
      should've been[*]).
      
      [*] afs_probe_cell_name() should have been, but that doesn't appear to be
      involved in the bug reports.
      
      The symptoms of this look like:
      
      	general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
      	KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
      	...
      	RIP: 0010:strncasecmp lib/string.c:52 [inline]
      	RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
      	 afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
      	 afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
      	 afs_parse_source fs/afs/super.c:290 [inline]
      	...
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hillf Danton <hdanton@sina.com>
      cc: syzkaller-bugs@googlegroups.com
      92e3cc91
  14. 09 10月, 2020 1 次提交
    • D
      afs: Fix deadlock between writeback and truncate · ec0fa0b6
      David Howells 提交于
      The afs filesystem has a lock[*] that it uses to serialise I/O operations
      going to the server (vnode->io_lock), as the server will only perform one
      modification operation at a time on any given file or directory.  This
      prevents the the filesystem from filling up all the call slots to a server
      with calls that aren't going to be executed in parallel anyway, thereby
      allowing operations on other files to obtain slots.
      
        [*] Note that is probably redundant for directories at least since
            i_rwsem is used to serialise directory modifications and
            lookup/reading vs modification.  The server does allow parallel
            non-modification ops, however.
      
      When a file truncation op completes, we truncate the in-memory copy of the
      file to match - but we do it whilst still holding the io_lock, the idea
      being to prevent races with other operations.
      
      However, if writeback starts in a worker thread simultaneously with
      truncation (whilst notify_change() is called with i_rwsem locked, writeback
      pays it no heed), it may manage to set PG_writeback bits on the pages that
      will get truncated before afs_setattr_success() manages to call
      truncate_pagecache().  Truncate will then wait for those pages - whilst
      still inside io_lock:
      
          # cat /proc/8837/stack
          [<0>] wait_on_page_bit_common+0x184/0x1e7
          [<0>] truncate_inode_pages_range+0x37f/0x3eb
          [<0>] truncate_pagecache+0x3c/0x53
          [<0>] afs_setattr_success+0x4d/0x6e
          [<0>] afs_wait_for_operation+0xd8/0x169
          [<0>] afs_do_sync_operation+0x16/0x1f
          [<0>] afs_setattr+0x1fb/0x25d
          [<0>] notify_change+0x2cf/0x3c4
          [<0>] do_truncate+0x7f/0xb2
          [<0>] do_sys_ftruncate+0xd1/0x104
          [<0>] do_syscall_64+0x2d/0x3a
          [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The writeback operation, however, stalls indefinitely because it needs to
      get the io_lock to proceed:
      
          # cat /proc/5940/stack
          [<0>] afs_get_io_locks+0x58/0x1ae
          [<0>] afs_begin_vnode_operation+0xc7/0xd1
          [<0>] afs_store_data+0x1b2/0x2a3
          [<0>] afs_write_back_from_locked_page+0x418/0x57c
          [<0>] afs_writepages_region+0x196/0x224
          [<0>] afs_writepages+0x74/0x156
          [<0>] do_writepages+0x2d/0x56
          [<0>] __writeback_single_inode+0x84/0x207
          [<0>] writeback_sb_inodes+0x238/0x3cf
          [<0>] __writeback_inodes_wb+0x68/0x9f
          [<0>] wb_writeback+0x145/0x26c
          [<0>] wb_do_writeback+0x16a/0x194
          [<0>] wb_workfn+0x74/0x177
          [<0>] process_one_work+0x174/0x264
          [<0>] worker_thread+0x117/0x1b9
          [<0>] kthread+0xec/0xf1
          [<0>] ret_from_fork+0x1f/0x30
      
      and thus deadlock has occurred.
      
      Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
      that the caller is holding i_rwsem doesn't preclude more pages being
      dirtied through an mmap'd region.
      
      Fix this by:
      
       (1) Use the vnode validate_lock to mediate access between afs_setattr()
           and afs_writepages():
      
           (a) Exclusively lock validate_lock in afs_setattr() around the whole
           	 RPC operation.
      
           (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
           	 shared-lock validate_lock and returning immediately if we couldn't
           	 get it.
      
           (c) If WB_SYNC_ALL is set, wait for the lock.
      
           The validate_lock is also used to validate a file and to zap its cache
           if the file was altered by a third party, so it's probably a good fit
           for this.
      
       (2) Move the truncation outside of the io_lock in setattr, using the same
           hook as is used for local directory editing.
      
           This requires the old i_size to be retained in the operation record as
           we commit the revised status to the inode members inside the io_lock
           still, but we still need to know if we reduced the file size.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec0fa0b6