1. 18 3月, 2022 2 次提交
  2. 07 1月, 2022 2 次提交
  3. 11 11月, 2021 1 次提交
  4. 02 11月, 2021 1 次提交
  5. 13 9月, 2021 4 次提交
    • D
      afs: Fix updating of i_blocks on file/dir extension · 9d37e1ca
      David Howells 提交于
      When an afs file or directory is modified locally such that the total file
      size is extended, i_blocks needs to be recalculated too.
      
      Fix this by making afs_write_end() and afs_edit_dir_add() call
      afs_set_i_size() rather than setting inode->i_size directly as that also
      recalculates inode->i_blocks.
      
      This can be tested by creating and writing into directories and files and
      then examining them with du.  Without this change, directories show a 4
      blocks (they start out at 2048 bytes) and files show 0 blocks; with this
      change, they should show a number of blocks proportional to the file size
      rounded up to 1024.
      
      Fixes: 31143d5d ("AFS: implement basic file write support")
      Fixes: 63a4681f ("afs: Locally edit directory data for mkdir/create/unlink/...")
      Reported-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/
      9d37e1ca
    • D
      afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server · b537a3c2
      David Howells 提交于
      AFS-3 has two data fetch RPC variants, FS.FetchData and FS.FetchData64, and
      Linux's afs client switches between them when talking to a non-YFS server
      if the read size, the file position or the sum of the two have the upper 32
      bits set of the 64-bit value.
      
      This is a problem, however, since the file position and length fields of
      FS.FetchData are *signed* 32-bit values.
      
      Fix this by capturing the capability bits obtained from the fileserver when
      it's sent an FS.GetCapabilities RPC, rather than just discarding them, and
      then picking out the VICED_CAPABILITY_64BITFILES flag.  This can then be
      used to decide whether to use FS.FetchData or FS.FetchData64 - and also
      FS.StoreData or FS.StoreData64 - rather than using upper_32_bits() to
      switch on the parameter values.
      
      This capabilities flag could also be used to limit the maximum size of the
      file, but all servers must be checked for that.
      
      Note that the issue does not exist with FS.StoreData - that uses *unsigned*
      32-bit values.  It's also not a problem with Auristor servers as its
      YFS.FetchData64 op uses unsigned 64-bit values.
      
      This can be tested by cloning a git repo through an OpenAFS client to an
      OpenAFS server and then doing "git status" on it from a Linux afs
      client[1].  Provided the clone has a pack file that's in the 2G-4G range,
      the git status will show errors like:
      
      	error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
      	error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
      
      This can be observed in the server's FileLog with something like the
      following appearing:
      
      Sun Aug 29 19:31:39 2021 SRXAFS_FetchData, Fid = 2303380852.491776.3263114, Host 192.168.11.201:7001, Id 1001
      Sun Aug 29 19:31:39 2021 CheckRights: len=0, for host=192.168.11.201:7001
      Sun Aug 29 19:31:39 2021 FetchData_RXStyle: Pos 18446744071815340032, Len 3154
      Sun Aug 29 19:31:39 2021 FetchData_RXStyle: file size 2400758866
      ...
      Sun Aug 29 19:31:40 2021 SRXAFS_FetchData returns 5
      
      Note the file position of 18446744071815340032.  This is the requested file
      position sign-extended.
      
      Fixes: b9b1f8d5 ("AFS: write support fixes")
      Reported-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      cc: openafs-devel@openafs.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c9 [1]
      Link: https://lore.kernel.org/r/951332.1631308745@warthog.procyon.org.uk/
      b537a3c2
    • D
      afs: Try to avoid taking RCU read lock when checking vnode validity · 4fe6a946
      David Howells 提交于
      Try to avoid taking the RCU read lock when checking the validity of a
      vnode's callback state.  The only thing it's needed for is to pin the
      parent volume's server list whilst we search it to find the record of the
      server we're currently using to see if it has been reinitialised (ie. it
      sent us a CB.InitCallBackState* RPC).
      
      Do this by the following means:
      
       (1) Keep an additional per-cell counter (fs_s_break) that's incremented
           each time any of the fileservers in the cell reinitialises.
      
           Since the new counter can be accessed without RCU from the vnode, we
           can check that first - and only if it differs, get the RCU read lock
           and check the volume's server list.
      
       (2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
           indicates whether the callback promise is still expected to be present
           on the server.  This does the checks as described in (1).
      
       (3) Restructure afs_check_validity() to take account of the change in (2).
      
           We can also get rid of the valid variable and just use the need_clear
           variable with the addition of the afs_cb_break_no_promise reason.
      
       (4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
           and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
      
           Move the change to vnode->cb_v_break to __afs_break_callback().
      
           Delegate the change to vnode->cb_s_break to afs_select_fileserver()
           and set vnode->cb_fs_s_break there also.
      
       (5) afs_validate() no longer needs to get the RCU read lock around its
           call to afs_check_validity() - and can skip the call entirely if we
           don't have a promise.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
      4fe6a946
    • D
      afs: Fix mmap coherency vs 3rd-party changes · 6e0e99d5
      David Howells 提交于
      Fix the coherency management of mmap'd data such that 3rd-party changes
      become visible as soon as possible after the callback notification is
      delivered by the fileserver.  This is done by the following means:
      
       (1) When we break a callback on a vnode specified by the CB.CallBack call
           from the server, we queue a work item (vnode->cb_work) to go and
           clobber all the PTEs mapping to that inode.
      
           This causes the CPU to trip through the ->map_pages() and
           ->page_mkwrite() handlers if userspace attempts to access the page(s)
           again.
      
           (Ideally, this would be done in the service handler for CB.CallBack,
           but the server is waiting for our reply before considering, and we
           have a list of vnodes, all of which need breaking - and the process of
           getting the mmap_lock and stripping the PTEs on all CPUs could be
           quite slow.)
      
       (2) Call afs_validate() from the ->map_pages() handler to check to see if
           the file has changed and to get a new callback promise from the
           server.
      
      Also handle the fileserver telling us that it's dropping all callbacks,
      possibly after it's been restarted by sending us a CB.InitCallBackState*
      call by the following means:
      
       (3) Maintain a per-cell list of afs files that are currently mmap'd
           (cell->fs_open_mmaps).
      
       (4) Add a work item to each server that is invoked if there are any open
           mmaps when CB.InitCallBackState happens.  This work item goes through
           the aforementioned list and invokes the vnode->cb_work work item for
           each one that is currently using this server.
      
           This causes the PTEs to be cleared, causing ->map_pages() or
           ->page_mkwrite() to be called again, thereby calling afs_validate()
           again.
      
      I've chosen to simply strip the PTEs at the point of notification reception
      rather than invalidate all the pages as well because (a) it's faster, (b)
      we may get a notification for other reasons than the data being altered (in
      which case we don't want to clobber the pagecache) and (c) we need to ask
      the server to find out - and I don't want to wait for the reply before
      holding up userspace.
      
      This was tested using the attached test program:
      
      	#include <stdbool.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      	#include <sys/mman.h>
      	int main(int argc, char *argv[])
      	{
      		size_t size = getpagesize();
      		unsigned char *p;
      		bool mod = (argc == 3);
      		int fd;
      		if (argc != 2 && argc != 3) {
      			fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
      			exit(2);
      		}
      		fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
      		if (fd < 0) {
      			perror(argv[1]);
      			exit(1);
      		}
      
      		p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
      			 MAP_SHARED, fd, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      		for (;;) {
      			if (mod) {
      				p[0]++;
      				msync(p, size, MS_ASYNC);
      				fsync(fd);
      			}
      			printf("%02x", p[0]);
      			fflush(stdout);
      			sleep(1);
      		}
      	}
      
      It runs in two modes: in one mode, it mmaps a file, then sits in a loop
      reading the first byte, printing it and sleeping for a second; in the
      second mode it mmaps a file, then sits in a loop incrementing the first
      byte and flushing, then printing and sleeping.
      
      Two instances of this program can be run on different machines, one doing
      the reading and one doing the writing.  The reader should see the changes
      made by the writer, but without this patch, they aren't because validity
      checking is being done lazily - only on entry to the filesystem.
      
      Testing the InitCallBackState change is more complicated.  The server has
      to be taken offline, the saved callback state file removed and then the
      server restarted whilst the reading-mode program continues to run.  The
      client machine then has to poke the server to trigger the InitCallBackState
      call.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
      6e0e99d5
  6. 02 5月, 2021 1 次提交
    • D
      afs: Fix speculative status fetches · 22650f14
      David Howells 提交于
      The generic/464 xfstest causes kAFS to emit occasional warnings of the
      form:
      
              kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)
      
      This indicates that the data version received back from the server did not
      match the expected value (the DV should be incremented monotonically for
      each individual modification op committed to a vnode).
      
      What is happening is that a lookup call is doing a bulk status fetch
      speculatively on a bunch of vnodes in a directory besides getting the
      status of the vnode it's actually interested in.  This is racing with a
      StoreData operation (though it could also occur with, say, a MakeDir op).
      
      On the client, a modification operation locks the vnode, but the bulk
      status fetch only locks the parent directory, so no ordering is imposed
      there (thereby avoiding an avenue to deadlock).
      
      On the server, the StoreData op handler doesn't lock the vnode until it's
      received all the request data, and downgrades the lock after committing the
      data until it has finished sending change notifications to other clients -
      which allows the status fetch to occur before it has finished.
      
      This means that:
      
       - a status fetch can access the target vnode either side of the exclusive
         section of the modification
      
       - the status fetch could start before the modification, yet finish after,
         and vice-versa.
      
       - the status fetch and the modification RPCs can complete in either order.
      
       - the status fetch can return either the before or the after DV from the
         modification.
      
       - the status fetch might regress the locally cached DV.
      
      Some of these are handled by the previous fix[1], but that's not sufficient
      because it checks the DV it received against the DV it cached at the start
      of the op, but the DV might've been updated in the meantime by a locally
      generated modification op.
      
      Fix this by the following means:
      
       (1) Keep track of when we're performing a modification operation on a
           vnode.  This is done by marking vnode parameters with a 'modification'
           note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
           for the duration.
      
       (2) Alter the speculation race detection to ignore speculative status
           fetches if either the vnode is marked as being modified or the data
           version number is not what we expected.
      
      Note that whilst the "vnode modified" warning does get recovered from as it
      causes the client to refetch the status at the next opportunity, it will
      also invalidate the pagecache, so changes might get lost.
      
      Fixes: a9e5c87c ("afs: Fix speculative status fetch going out of order wrt to modifications")
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-and-reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
      Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22650f14
  7. 23 4月, 2021 9 次提交
  8. 16 3月, 2021 1 次提交
  9. 24 1月, 2021 1 次提交
    • C
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner 提交于
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
  10. 23 11月, 2020 1 次提交
    • D
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells 提交于
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
  11. 29 10月, 2020 4 次提交
    • D
      afs: Fix dirty-region encoding on ppc32 with 64K pages · 2d9900f2
      David Howells 提交于
      The dirty region bounds stored in page->private on an afs page are 15 bits
      on a 32-bit box and can, at most, represent a range of up to 32K within a
      32K page with a resolution of 1 byte.  This is a problem for powerpc32 with
      64K pages enabled.
      
      Further, transparent huge pages may get up to 2M, which will be a problem
      for the afs filesystem on all 32-bit arches in the future.
      
      Fix this by decreasing the resolution.  For the moment, a 64K page will
      have a resolution determined from PAGE_SIZE.  In the future, the page will
      need to be passed in to the helper functions so that the page size can be
      assessed and the resolution determined dynamically.
      
      Note that this might not be the ideal way to handle this, since it may
      allow some leakage of undirtied zero bytes to the server's copy in the case
      of a 3rd-party conflict.  Fixing that would require a separately allocated
      record and is a more complicated fix.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      2d9900f2
    • D
      afs: Fix afs_invalidatepage to adjust the dirty region · f86726a6
      David Howells 提交于
      Fix afs_invalidatepage() to adjust the dirty region recorded in
      page->private when truncating a page.  If the dirty region is entirely
      removed, then the private data is cleared and the page dirty state is
      cleared.
      
      Without this, if the page is truncated and then expanded again by truncate,
      zeros from the expanded, but no-longer dirty region may get written back to
      the server if the page gets laundered due to a conflicting 3rd-party write.
      
      It mustn't, however, shorten the dirty region of the page if that page is
      still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
      stored in page->private to record this.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f86726a6
    • D
      afs: Alter dirty range encoding in page->private · 65dd2d60
      David Howells 提交于
      Currently, page->private on an afs page is used to store the range of
      dirtied data within the page, where the range includes the lower bound, but
      excludes the upper bound (e.g. 0-1 is a range covering a single byte).
      
      This, however, requires a superfluous bit for the last-byte bound so that
      on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
      being that having both numbers the same would indicate an empty range.
      This is unnecessary as the PG_private bit is clear if it's an empty range
      (as is PG_dirty).
      
      Alter the way the dirty range is encoded in page->private such that the
      upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
      byte range mentioned above).
      
      Applying this to both bounds frees up two bits, one of which can be used in
      a future commit.
      
      This allows the afs filesystem to be compiled on ppc32 with 64K pages;
      without this, the following warnings are seen:
      
      ../fs/afs/internal.h: In function 'afs_page_dirty_to':
      ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
        881 |  return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
            |               ^~
      ../fs/afs/internal.h: In function 'afs_page_dirty':
      ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
        886 |  return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
            |                            ^~
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      65dd2d60
    • D
      afs: Wrap page->private manipulations in inline functions · 185f0c70
      David Howells 提交于
      The afs filesystem uses page->private to store the dirty range within a
      page such that in the event of a conflicting 3rd-party write to the server,
      we write back just the bits that got changed locally.
      
      However, there are a couple of problems with this:
      
       (1) I need a bit to note if the page might be mapped so that partial
           invalidation doesn't shrink the range.
      
       (2) There aren't necessarily sufficient bits to store the entire range of
           data altered (say it's a 32-bit system with 64KiB pages or transparent
           huge pages are in use).
      
      So wrap the accesses in inline functions so that future commits can change
      how this works.
      
      Also move them out of the tracing header into the in-directory header.
      There's not really any need for them to be in the tracing header.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      185f0c70
  12. 28 10月, 2020 1 次提交
  13. 16 10月, 2020 4 次提交
    • D
      afs: Add tracing for cell refcount and active user count · dca54a7b
      David Howells 提交于
      Add a tracepoint to log the cell refcount and active user count and pass in
      a reason code through various functions that manipulate these counters.
      
      Additionally, a helper function, afs_see_cell(), is provided to log
      interesting places that deal with a cell without actually doing any
      accounting directly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dca54a7b
    • D
      afs: Fix cell removal · 1d0e850a
      David Howells 提交于
      Fix cell removal by inserting a more final state than AFS_CELL_FAILED that
      indicates that the cell has been unpublished in case the manager is already
      requeued and will go through again.  The new AFS_CELL_REMOVED state will
      just immediately leave the manager function.
      
      Going through a second time in the AFS_CELL_FAILED state will cause it to
      try to remove the cell again, potentially leading to the proc list being
      removed.
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com
      Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com
      Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com
      Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com
      Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com
      Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com
      Suggested-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hillf Danton <hdanton@sina.com>
      1d0e850a
    • D
      afs: Fix cell refcounting by splitting the usage counter · 88c853c3
      David Howells 提交于
      Management of the lifetime of afs_cell struct has some problems due to the
      usage counter being used to determine whether objects of that type are in
      use in addition to whether anyone might be interested in the structure.
      
      This is made trickier by cell objects being cached for a period of time in
      case they're quickly reused as they hold the result of a setup process that
      may be slow (DNS lookups, AFS RPC ops).
      
      Problems include the cached root volume from alias resolution pinning its
      parent cell record, rmmod occasionally hanging and occasionally producing
      assertion failures.
      
      Fix this by splitting the count of active users from the struct reference
      count.  Things then work as follows:
      
       (1) The cell cache keeps +1 on the cell's activity count and this has to
           be dropped before the cell can be removed.  afs_manage_cell() tries to
           exchange the 1 to a 0 with the cells_lock write-locked, and if
           successful, the record is removed from the net->cells.
      
       (2) One struct ref is 'owned' by the activity count.  That is put when the
           active count is reduced to 0 (final_destruction label).
      
       (3) A ref can be held on a cell whilst it is queued for management on a
           work queue without confusing the active count.  afs_queue_cell() is
           added to wrap this.
      
       (4) The queue's ref is dropped at the end of the management.  This is
           split out into a separate function, afs_manage_cell_work().
      
       (5) The root volume record is put after a cell is removed (at the
           final_destruction label) rather then in the RCU destruction routine.
      
       (6) Volumes hold struct refs, but aren't active users.
      
       (7) Both counts are displayed in /proc/net/afs/cells.
      
      There are some management function changes:
      
       (*) afs_put_cell() now just decrements the refcount and triggers the RCU
           destruction if it becomes 0.  It no longer sets a timer to have the
           manager do this.
      
       (*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
           the active count.  afs_unuse_cell() sets the management timer.
      
       (*) afs_queue_cell() is added to queue a cell with approprate refs.
      
      There are also some other fixes:
      
       (*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.
      
       (*) Make sure that candidate cells in lookups are properly destroyed
           rather than being simply kfree'd.  This ensures the bits it points to
           are destroyed also.
      
       (*) afs_dec_cells_outstanding() is now called in cell destruction rather
           than at "final_destruction".  This ensures that cell->net is still
           valid to the end of the destructor.
      
       (*) As a consequence of the previous two changes, move the increment of
           net->cells_outstanding that was at the point of insertion into the
           tree to the allocation routine to correctly balance things.
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      88c853c3
    • D
      afs: Fix rapid cell addition/removal by not using RCU on cells tree · 92e3cc91
      David Howells 提交于
      There are a number of problems that are being seen by the rapidly mounting
      and unmounting an afs dynamic root with an explicit cell and volume
      specified (which should probably be rejected, but that's a separate issue):
      
      What the tests are doing is to look up/create a cell record for the name
      given and then tear it down again without actually using it to try to talk
      to a server.  This is repeated endlessly, very fast, and the new cell
      collides with the old one if it's not quick enough to reuse it.
      
      It appears (as suggested by Hillf Danton) that the search through the RB
      tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
      that it's not blocking the write_seqlock(), despite taking two passes at
      it.  He suggested that the code should take a ref on the cell it's
      attempting to look at - but this shouldn't be necessary until we've
      compared the cell names.  It's possible that I'm missing a barrier
      somewhere.
      
      However, using an RCU search for this is overkill, really - we only need to
      access the cell name in a few places, and they're places where we're may
      end up sleeping anyway.
      
      Fix this by switching to an R/W semaphore instead.
      
      Additionally, draw the down_read() call inside the function (renamed to
      afs_find_cell()) since all the callers were taking the RCU read lock (or
      should've been[*]).
      
      [*] afs_probe_cell_name() should have been, but that doesn't appear to be
      involved in the bug reports.
      
      The symptoms of this look like:
      
      	general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
      	KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
      	...
      	RIP: 0010:strncasecmp lib/string.c:52 [inline]
      	RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
      	 afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
      	 afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
      	 afs_parse_source fs/afs/super.c:290 [inline]
      	...
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hillf Danton <hdanton@sina.com>
      cc: syzkaller-bugs@googlegroups.com
      92e3cc91
  14. 09 10月, 2020 1 次提交
    • D
      afs: Fix deadlock between writeback and truncate · ec0fa0b6
      David Howells 提交于
      The afs filesystem has a lock[*] that it uses to serialise I/O operations
      going to the server (vnode->io_lock), as the server will only perform one
      modification operation at a time on any given file or directory.  This
      prevents the the filesystem from filling up all the call slots to a server
      with calls that aren't going to be executed in parallel anyway, thereby
      allowing operations on other files to obtain slots.
      
        [*] Note that is probably redundant for directories at least since
            i_rwsem is used to serialise directory modifications and
            lookup/reading vs modification.  The server does allow parallel
            non-modification ops, however.
      
      When a file truncation op completes, we truncate the in-memory copy of the
      file to match - but we do it whilst still holding the io_lock, the idea
      being to prevent races with other operations.
      
      However, if writeback starts in a worker thread simultaneously with
      truncation (whilst notify_change() is called with i_rwsem locked, writeback
      pays it no heed), it may manage to set PG_writeback bits on the pages that
      will get truncated before afs_setattr_success() manages to call
      truncate_pagecache().  Truncate will then wait for those pages - whilst
      still inside io_lock:
      
          # cat /proc/8837/stack
          [<0>] wait_on_page_bit_common+0x184/0x1e7
          [<0>] truncate_inode_pages_range+0x37f/0x3eb
          [<0>] truncate_pagecache+0x3c/0x53
          [<0>] afs_setattr_success+0x4d/0x6e
          [<0>] afs_wait_for_operation+0xd8/0x169
          [<0>] afs_do_sync_operation+0x16/0x1f
          [<0>] afs_setattr+0x1fb/0x25d
          [<0>] notify_change+0x2cf/0x3c4
          [<0>] do_truncate+0x7f/0xb2
          [<0>] do_sys_ftruncate+0xd1/0x104
          [<0>] do_syscall_64+0x2d/0x3a
          [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The writeback operation, however, stalls indefinitely because it needs to
      get the io_lock to proceed:
      
          # cat /proc/5940/stack
          [<0>] afs_get_io_locks+0x58/0x1ae
          [<0>] afs_begin_vnode_operation+0xc7/0xd1
          [<0>] afs_store_data+0x1b2/0x2a3
          [<0>] afs_write_back_from_locked_page+0x418/0x57c
          [<0>] afs_writepages_region+0x196/0x224
          [<0>] afs_writepages+0x74/0x156
          [<0>] do_writepages+0x2d/0x56
          [<0>] __writeback_single_inode+0x84/0x207
          [<0>] writeback_sb_inodes+0x238/0x3cf
          [<0>] __writeback_inodes_wb+0x68/0x9f
          [<0>] wb_writeback+0x145/0x26c
          [<0>] wb_do_writeback+0x16a/0x194
          [<0>] wb_workfn+0x74/0x177
          [<0>] process_one_work+0x174/0x264
          [<0>] worker_thread+0x117/0x1b9
          [<0>] kthread+0xec/0xf1
          [<0>] ret_from_fork+0x1f/0x30
      
      and thus deadlock has occurred.
      
      Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
      that the caller is holding i_rwsem doesn't preclude more pages being
      dirtied through an mmap'd region.
      
      Fix this by:
      
       (1) Use the vnode validate_lock to mediate access between afs_setattr()
           and afs_writepages():
      
           (a) Exclusively lock validate_lock in afs_setattr() around the whole
           	 RPC operation.
      
           (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
           	 shared-lock validate_lock and returning immediately if we couldn't
           	 get it.
      
           (c) If WB_SYNC_ALL is set, wait for the lock.
      
           The validate_lock is also used to validate a file and to zap its cache
           if the file was altered by a third party, so it's probably a good fit
           for this.
      
       (2) Move the truncation outside of the io_lock in setattr, using the same
           hook as is used for local directory editing.
      
           This requires the old i_size to be retained in the operation record as
           we commit the revised status to the inode members inside the io_lock
           still, but we still need to know if we reduced the file size.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec0fa0b6
  15. 21 8月, 2020 3 次提交
  16. 28 6月, 2020 1 次提交
  17. 21 6月, 2020 1 次提交
    • D
      afs: Fix hang on rmmod due to outstanding timer · 5481fc6e
      David Howells 提交于
      The fileserver probe timer, net->fs_probe_timer, isn't cancelled when
      the kafs module is being removed and so the count it holds on
      net->servers_outstanding doesn't get dropped..
      
      This causes rmmod to wait forever.  The hung process shows a stack like:
      
      	afs_purge_servers+0x1b5/0x23c [kafs]
      	afs_net_exit+0x44/0x6e [kafs]
      	ops_exit_list+0x72/0x93
      	unregister_pernet_operations+0x14c/0x1ba
      	unregister_pernet_subsys+0x1d/0x2a
      	afs_exit+0x29/0x6f [kafs]
      	__do_sys_delete_module.isra.0+0x1a2/0x24b
      	do_syscall_64+0x51/0x95
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by:
      
       (1) Attempting to cancel the probe timer and, if successful, drop the
           count that the timer was holding.
      
       (2) Make the timer function just drop the count and not schedule the
           prober if the afs portion of net namespace is being destroyed.
      
      Also, whilst we're at it, make the following changes:
      
       (3) Initialise net->servers_outstanding to 1 and decrement it before
           waiting on it so that it doesn't generate wake up events by being
           decremented to 0 until we're cleaning up.
      
       (4) Switch the atomic_dec() on ->servers_outstanding for ->fs_timer in
           afs_purge_servers() to use the helper function for that.
      
      Fixes: f6cbb368 ("afs: Actively poll fileservers to maintain NAT or firewall openings")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5481fc6e
  18. 17 6月, 2020 1 次提交
    • D
      afs: Fix silly rename · b6489a49
      David Howells 提交于
      Fix AFS's silly rename by the following means:
      
       (1) Set the destination directory in afs_do_silly_rename() so as to avoid
           misbehaviour and indicate that the directory data version will
           increment by 1 so as to avoid warnings about unexpected changes in the
           DV.  Also indicate that the ctime should be updated to avoid xfstest
           grumbling.
      
       (2) Note when the server indicates that a directory changed more than we
           expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a
           third party change, checking on successful completion of unlink and
           rename.
      
           The problem is that the FS.RemoveFile RPC op doesn't report the status
           of the unlinked file, though YFS.RemoveFile2 does.  This can be
           mitigated by the assumption that if the directory DV cranked by
           exactly 1, we can be sure we removed one link from the file; further,
           ordinarily in AFS, files cannot be hardlinked across directories, so
           if we reduce nlink to 0, the file is deleted.
      
           However, if the directory DV jumps by more than 1, we cannot know if a
           third party intervened by adding or removing a link on the file we
           just removed a link from.
      
           The same also goes for any vnode that is at the destination of the
           FS.Rename RPC op.
      
       (3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock
           section along with the other attribute updates if ->op_unlinked is set
           on the descriptor for the appropriate vnode.
      
       (4) Issue a follow up status fetch to the unlinked file in the event of a
           third party conflict that makes it impossible for us to know if we
           actually deleted the file or not.
      
       (5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to
           the user about the nlink of a silly deleted file so that it appears as
           0, not 1.
      
      Found with the generic/035 and generic/084 xfstests.
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b6489a49
  19. 16 6月, 2020 1 次提交
    • D
      afs: Fix use of afs_check_for_remote_deletion() · 728279a5
      David Howells 提交于
      afs_check_for_remote_deletion() checks to see if error ENOENT is returned
      by the server in response to an operation and, if so, marks the primary
      vnode as having been deleted as the FID is no longer valid.
      
      However, it's being called from the operation success functions, where no
      abort has happened - and if an inline abort is recorded, it's handled by
      afs_vnode_commit_status().
      
      Fix this by actually calling the operation aborted method if provided and
      having that point to afs_check_for_remote_deletion().
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      728279a5