1. 10 4月, 2018 13 次提交
    • D
      afs: Do better accretion of small writes on newly created content · 5a813276
      David Howells 提交于
      Processes like ld that do lots of small writes that aren't necessarily
      contiguous result in a lot of small StoreData operations to the server, the
      idea being that if someone else changes the data on the server, we only
      write our changes over that and not the space between.  Further, we don't
      want to write back empty space if we can avoid it to make it easier for the
      server to do sparse files.
      
      However, making lots of tiny RPC ops is a lot less efficient for the server
      than one big one because each op requires allocation of resources and the
      taking of locks, so we want to compromise a bit.
      
      Reduce the load by the following:
      
       (1) If a file is just created locally or has just been truncated with
           O_TRUNC locally, allow subsequent writes to the file to be merged with
           intervening space if that space doesn't cross an entire intervening
           page.
      
       (2) Don't flush the file on ->flush() but rather on ->release() if the
           file was open for writing.
      
      Just linking vmlinux.o, without this patch, looking in /proc/fs/afs/stats:
      
      	file-wr : n=441 nb=513581204
      
      and after the patch:
      
      	file-wr : n=62 nb=513668555
      
      there were 379 fewer StoreData RPC operations at the expense of an extra
      87K being written.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5a813276
    • D
      afs: Add stats for data transfer operations · 76a5cb6f
      David Howells 提交于
      Add statistics to /proc/fs/afs/stats for data transfer RPC operations.  New
      lines are added that look like:
      
      	file-rd : n=55794 nb=10252282150
      	file-wr : n=9789 nb=3247763645
      
      where n= indicates the number of ops completed and nb= indicates the number
      of bytes successfully transferred.  file-rd is the counts for read/fetch
      operations and file-wr the counts for write/store operations.
      
      Note that directory and symlink downloading are included in the file-rd
      stats at the moment.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      76a5cb6f
    • D
      afs: Trace protocol errors · 5f702c8e
      David Howells 提交于
      Trace protocol errors detected in afs.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5f702c8e
    • D
      afs: Locally edit directory data for mkdir/create/unlink/... · 63a4681f
      David Howells 提交于
      Locally edit the contents of an AFS directory upon a successful inode
      operation that modifies that directory (such as mkdir, create and unlink)
      so that we can avoid the current practice of re-downloading the directory
      after each change.
      
      This is viable provided that the directory version number we get back from
      the modifying RPC op is exactly incremented by 1 from what we had
      previously.  The data in the directory contents is in a defined format that
      we have to parse locally to perform lookups and readdir, so modifying isn't
      a problem.
      
      If the edit fails, we just clear the VALID flag on the directory and it
      will be reloaded next time it is needed.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      63a4681f
    • D
      afs: Fix directory handling · f3ddee8d
      David Howells 提交于
      AFS directories are structured blobs that are downloaded just like files
      and then parsed by the lookup and readdir code and, as such, are currently
      handled in the pagecache like any other file, with the entire directory
      content being thrown away each time the directory changes.
      
      However, since the blob is a known structure and since the data version
      counter on a directory increases by exactly one for each change committed
      to that directory, we can actually edit the directory locally rather than
      fetching it from the server after each locally-induced change.
      
      What we can't do, though, is mix data from the server and data from the
      client since the server is technically at liberty to rearrange or compress
      a directory if it sees fit, provided it updates the data version number
      when it does so and breaks the callback (ie. sends a notification).
      
      Further, lookup with lookup-ahead, readdir and, when it arrives, local
      editing are likely want to scan the whole of a directory.
      
      So directory handling needs to be improved to maintain the coherency of the
      directory blob prior to permitting local directory editing.
      
      To this end:
      
       (1) If any directory page gets discarded, invalidate and reread the entire
           directory.
      
       (2) If readpage notes that if when it fetches a single page that the
           version number has changed, the entire directory is flagged for
           invalidation.
      
       (3) Read as much of the directory in one go as we can.
      
      Note that this removes local caching of directories in fscache for the
      moment as we can't pass the pages to fscache_read_or_alloc_pages() since
      page->lru is in use by the LRU.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3ddee8d
    • D
      afs: Split the dynroot stuff out and give it its own ops tables · 66c7e1d3
      David Howells 提交于
      Split the AFS dynamic root stuff out of the main directory handling file
      and into its own file as they share little in common.
      
      The dynamic root code also gets its own dentry and inode ops tables.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      66c7e1d3
    • D
      afs: Keep track of invalid-before version for dentry coherency · a4ff7401
      David Howells 提交于
      Each afs dentry is tagged with the version that the parent directory was at
      last time it was validated and, currently, if this differs, the directory
      is scanned and the dentry is refreshed.
      
      However, this leads to an excessive amount of revalidation on directories
      that get modified on the client without conflict with another client.  We
      know there's no conflict because the parent directory's data version number
      got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore
      we can trust the current state of the unaffected dentries when we perform a
      local directory modification.
      
      Optimise by keeping track of the last version of the parent directory that
      was changed outside of the client in the parent directory's vnode and using
      that to validate the dentries rather than the current version.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a4ff7401
    • D
      afs: Rearrange status mapping · dd9fbcb8
      David Howells 提交于
      Rearrange the AFSFetchStatus to inode attribute mapping code in a number of
      ways:
      
       (1) Use an XDR structure rather than a series of incremented pointer
           accesses when decoding an AFSFetchStatus object.  This allows
           out-of-order decode.
      
       (2) Don't store the if_version value but rather just check it and abort if
           it's not something we can handle.
      
       (3) Store the owner and group in the status record as raw values rather
           than converting them to kuid/kgid.  Do that when they're mapped into
           i_uid/i_gid.
      
       (4) Validate the type and abort code up front and abort if they're wrong.
      
       (5) Split the inode attribute setting out into its own function from the
           XDR decode of an AFSFetchStatus object.  This allows it to be called
           from elsewhere too.
      
       (6) Differentiate changes to data from changes to metadata.
      
       (7) Use the split-out attribute mapping function from afs_iget().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dd9fbcb8
    • D
      afs: Make it possible to get the data version in readpage · 0c3a5ac2
      David Howells 提交于
      Store the data version number indicated by an FS.FetchData op into the read
      request structure so that it's accessible by the page reader.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0c3a5ac2
    • D
      afs: Introduce a statistics proc file · d55b4da4
      David Howells 提交于
      Introduce a proc file that displays a bunch of statistics for the AFS
      filesystem in the current network namespace.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d55b4da4
    • D
      afs: Implement @sys substitution handling · 6f8880d8
      David Howells 提交于
      Implement the AFS feature by which @sys at the end of a pathname component
      may be substituted for one of a list of values, typically naming the
      operating system.  Up to 16 alternatives may be specified and these are
      tried in turn until one works.  Each network namespace has[*] a separate
      independent list.
      
      Upon creation of a new network namespace, the list of values is
      initialised[*] to a single OpenAFS-compatible string representing arch type
      plus "_linux26".  For example, on x86_64, the sysname is "amd64_linux26".
      
      [*] Or will, once network namespace support is finalised in kAFS.
      
      The list may be set by:
      
      	# for i in foo bar linux-x86_64; do echo $i; done >/proc/fs/afs/sysname
      
      for which separate writes to the same fd are amalgamated and applied on
      close.  The LF character may be used as a separator to specify multiple
      items in the same write() call.
      
      The list may be cleared by:
      
      	# echo >/proc/fs/afs/sysname
      
      and read by:
      
      	# cat /proc/fs/afs/sysname
      	foo
      	bar
      	linux-x86_64
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6f8880d8
    • D
      afs: Prospectively look up extra files when doing a single lookup · 5cf9dd55
      David Howells 提交于
      When afs_lookup() is called, prospectively look up the next 50 uncached
      fids also from that same directory and cache the results, rather than just
      looking up the one file requested.
      
      This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency
      by fetching up to 50 file statuses at a time.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5cf9dd55
    • D
      afs: Fix checker warnings · fe342cf7
      David Howells 提交于
      Fix warnings raised by checker, including:
      
       (*) Warnings raised by unequal comparison for the purposes of sorting,
           where the endianness doesn't matter:
      
      fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer
      fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer
      fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer
      
       (*) afs_set_cb_interest() is not actually used and can be removed.
      
       (*) afs_cell_gc_delay() should be provided with a sysctl.
      
       (*) afs_cell_destroy() needs to use rcu_access_pointer() to read
           cell->vl_addrs.
      
       (*) afs_init_fs_cursor() should be static.
      
       (*) struct afs_vnode::permit_cache needs to be marked __rcu.
      
       (*) afs_server_rcu() needs to use rcu_access_pointer().
      
       (*) afs_destroy_server() should use rcu_access_pointer() on
           server->addresses as the server object is no longer accessible.
      
       (*) afs_find_server() casts __be16/__be32 values to int in order to
           directly compare them for the purpose of finding a match in a list,
           but is should also annotate the cast with __force to avoid checker
           warnings.
      
       (*) afs_check_permit() accesses vnode->permit_cache outside of the RCU
           readlock, though it doesn't then access the value; the extraneous
           access is deleted.
      
      False positives:
      
       (*) Conditional locking around the code in xdr_decode_AFSFetchStatus.  This
           can be dealt with in a separate patch.
      
      fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block
      
       (*) Incorrect handling of seq-retry lock context balance:
      
      fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different
      lock contexts for basic block
      fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block
      fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block
      
      Errors:
      
       (*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go
           round again if it successfully found the workstation cell.
      
       (*) Fix UUID decode in afs_deliver_cb_probe_uuid().
      
       (*) afs_cache_permit() has a missing rcu_read_unlock() before one of the
           jumps to the someone_else_changed_it label.  Move the unlock to after
           the label.
      
       (*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when
           encoding to XDR.
      
       (*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl()
           when decoding from XDR.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      fe342cf7
  2. 04 4月, 2018 1 次提交
    • D
      fscache: Attach the index key and aux data to the cookie · 402cb8dd
      David Howells 提交于
      Attach copies of the index key and auxiliary data to the fscache cookie so
      that:
      
       (1) The callbacks to the netfs for this stuff can be eliminated.  This
           can simplify things in the cache as the information is still
           available, even after the cache has relinquished the cookie.
      
       (2) Simplifies the locking requirements of accessing the information as we
           don't have to worry about the netfs object going away on us.
      
       (3) The cache can do lazy updating of the coherency information on disk.
           As long as the cache is flushed before reboot/poweroff, there's no
           need to update the coherency info on disk every time it changes.
      
       (4) Cookies can be hashed or put in a tree as the index key is easily
           available.  This allows:
      
           (a) Checks for duplicate cookies can be made at the top fscache layer
           	 rather than down in the bowels of the cache backend.
      
           (b) Caching can be added to a netfs object that has a cookie if the
           	 cache is brought online after the netfs object is allocated.
      
      A certain amount of space is made in the cookie for inline copies of the
      data, but if it won't fit there, extra memory will be allocated for it.
      
      The downside of this is that live cache operation requires more memory.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAnna Schumaker <anna.schumaker@netapp.com>
      Tested-by: NSteve Dickson <steved@redhat.com>
      402cb8dd
  3. 28 3月, 2018 1 次提交
    • D
      rxrpc, afs: Use debug_ids rather than pointers in traces · a25e21f0
      David Howells 提交于
      In rxrpc and afs, use the debug_ids that are monotonically allocated to
      various objects as they're allocated rather than pointers as kernel
      pointers are now hashed making them less useful.  Further, the debug ids
      aren't reused anywhere nearly as quickly.
      
      In addition, allow kernel services that use rxrpc, such as afs, to take
      numbers from the rxrpc counter, assign them to their own call struct and
      pass them in to rxrpc for both client and service calls so that the trace
      lines for each will have the same ID tag.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a25e21f0
  4. 06 2月, 2018 1 次提交
    • D
      afs: Support the AFS dynamic root · 4d673da1
      David Howells 提交于
      Support the AFS dynamic root which is a pseudo-volume that doesn't connect
      to any server resource, but rather is just a root directory that
      dynamically creates mountpoint directories where the name of such a
      directory is the name of the cell.
      
      Such a mount can be created thus:
      
      	mount -t afs none /afs -o dyn
      
      Dynamic root superblocks aren't shared except by bind mounts and
      propagation.  Cell root volumes can then be mounted by referring to them by
      name, e.g.:
      
      	ls /afs/grand.central.org/
      	ls /afs/.grand.central.org/
      
      The kernel will upcall to consult the DNS if the address wasn't supplied
      directly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4d673da1
  5. 01 12月, 2017 1 次提交
    • D
      afs: Properly reset afs_vnode (inode) fields · f8de483e
      David Howells 提交于
      When an AFS inode is allocated by afs_alloc_inode(), the allocated
      afs_vnode struct isn't necessarily reset from the last time it was used as
      an inode because the slab constructor is only invoked once when the memory
      is obtained from the page allocator.
      
      This means that information can leak from one inode to the next because
      we're not calling kmem_cache_zalloc().  Some of the information isn't
      reset, in particular the permit cache pointer.
      
      Bring the clearances up to date.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      f8de483e
  6. 17 11月, 2017 1 次提交
    • D
      afs: Fix file locking · 0fafdc9f
      David Howells 提交于
      Fix the AFS file locking whereby the use of the big kernel lock (which
      could be slept with) was replaced by a spinlock (which couldn't).  The
      problem is that the AFS code was doing stuff inside the critical section
      that might call schedule(), so this is a broken transformation.
      
      Fix this by the following means:
      
       (1) Use a state machine with a proper state that can only be changed under
           the spinlock rather than using a collection of bit flags.
      
       (2) Cache the key used for the lock and the lock type in the afs_vnode
           struct so that the manager work function doesn't have to refer to a
           file_lock struct that's been dequeued.  This makes signal handling
           safer.
      
       (4) Move the unlock from afs_do_unlk() to afs_fl_release_private() which
           means that unlock is achieved in other circumstances too.
      
       (5) Unlock the file on the server before taking the next conflicting lock.
      
      Also change:
      
       (1) Check the permits on a file before actually trying the lock.
      
       (2) fsync the file before effecting an explicit unlock operation.  We
           don't fsync if the lock is erased otherwise as we might not be in a
           context where we can actually do that.
      
      Further fixes:
      
       (1) Fixed-fileserver address rotation is made to work.  It's only used by
           the locking functions, so couldn't be tested before.
      
      Fixes: 72f98e72 ("locks: turn lock_flocks into a spinlock")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: jlayton@redhat.com
      0fafdc9f
  7. 13 11月, 2017 22 次提交
    • D
      afs: Protect call->state changes against signals · 98bf40cd
      David Howells 提交于
      Protect call->state changes against the call being prematurely terminated
      due to a signal.
      
      What can happen is that a signal causes afs_wait_for_call_to_complete() to
      abort an afs_call because it's not yet complete whilst afs_deliver_to_call()
      is delivering data to that call.
      
      If the data delivery causes the state to change, this may overwrite the state
      of the afs_call, making it not-yet-complete again - but no further
      notifications will be forthcoming from AF_RXRPC as the rxrpc call has been
      aborted and completed, so kAFS will just hang in various places waiting for
      that call or on page bits that need clearing by that call.
      
      A tracepoint to monitor call state changes is also provided.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      98bf40cd
    • D
      afs: Implement shared-writeable mmap · 1cf7a151
      David Howells 提交于
      Implement shared-writeable mmap for AFS.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1cf7a151
    • D
      afs: Get rid of the afs_writeback record · 4343d008
      David Howells 提交于
      Get rid of the afs_writeback record that kAFS is using to match keys with
      writes made by that key.
      
      Instead, keep a list of keys that have a file open for writing and/or
      sync'ing and iterate through those.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4343d008
    • D
      afs: Introduce a file-private data record · 215804a9
      David Howells 提交于
      Introduce a file-private data record for kAFS and put the key into it
      rather than storing the key in file->private_data.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      215804a9
    • D
      afs: Fix directory read/modify race · dab17c1a
      David Howells 提交于
      Because parsing of the directory wasn't being done under any sort of lock,
      the pages holding the directory content can get invalidated whilst the
      parsing is ongoing.
      
      Further, the directory page check function gets called outside of the page
      lock, so if the page gets cleared or updated, this may return reports of
      bad magic numbers in the directory page.
      
      Also, the directory may change size whilst checking and parsing are
      ongoing, so more care needs to be taken here.
      
      Fix this by:
      
       (1) Perform the page check from the page filling function before we set
           PageUptodate and drop the page lock.
      
       (2) Check for the file having shrunk and the page having been abandoned
           before checking the page contents.
      
       (3) Lock the page whilst parsing it for the directory iterator.
      
      Whilst we're at it, add a tracepoint to report check failure.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dab17c1a
    • D
      afs: Trace the initiation and completion of client calls · 025db80c
      David Howells 提交于
      Add tracepoints to trace the initiation and completion of client calls
      within the kafs filesystem.
      
      The afs_make_vl_call tracepoint watches calls to the volume location
      database server.
      
      The afs_make_fs_call tracepoint watches calls to the file server.
      
      The afs_call_done tracepoint watches for call completion.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      025db80c
    • D
      afs: Make use of the YFS service upgrade to fully support IPv6 · bf99a53c
      David Howells 提交于
      YFS VL servers offer an upgraded Volume Location service that can return
      IPv6 addresses to fileservers and volume servers in addition to IPv4
      addresses using the YFSVL.GetEndpoints operation which we should use if
      it's available.
      
      To this end:
      
       (1) Make rxrpc_kernel_recv_data() return the call's current service ID so
           that the caller can detect service upgrade and see what the service
           was upgraded to.
      
       (2) When we see a VL server address we haven't seen before, send a
           VL.GetCapabilities operation to it with the service upgrade bit set.
      
           If we get an upgrade to the YFS VL service, change the service ID in
           the address list for that address to use the upgraded service and set
           a flag to note that this appears to be a YFS-compatible server.
      
       (3) If, when a server's addresses are being looked up, we note that we
           previously detected a YFS-compatible server, then send the
           YFSVL.GetEndpoints operation rather than VL.GetAddrsU.
      
       (4) Build a fileserver address list from the reply of YFSVL.GetEndpoints,
           including both IPv4 and IPv6 addresses.  Volume server addresses are
           discarded.
      
       (5) The address list is sorted by address and port now, instead of just
           address.  This allows multiple servers on the same host sitting on
           different ports.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bf99a53c
    • D
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells 提交于
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      servers.
      
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      
      To this end, the following structural changes are made:
      
       (1) Server record management is overhauled:
      
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
      
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
      
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
      
           (d) Server records are now keyed by their UUID within the namespace.
      
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           	 parameter.
      
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
      
           (g) The servers list is now in /proc/fs/afs/servers.
      
       (2) Volume record management is overhauled:
      
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
      
           (b) The superblock is now keyed on cell record and numeric volume ID.
      
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
      
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
      
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
      
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      
      and the following procedural changes are made:
      
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
      
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
      
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
      
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
      
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
      
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
      
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           	 moment.
      
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           	 salvaging.
      
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
      
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
      
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
      
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
      
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
      
      Notes:
      
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
      
       (2) VBUSY is retried forever for the moment at intervals of 1s.
      
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d2ddc776
    • D
      afs: Add an address list concept · 8b2a464c
      David Howells 提交于
      Add an RCU replaceable address list structure to hold a list of server
      addresses.  The list also holds the
      
      To this end:
      
       (1) A cell's VL server address list can be loaded directly via insmod or
           echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
           or SRV records.
      
       (2) Anyone wanting to use a cell's VL server address must wait until the
           cell record comes online and has tried to obtain some addresses.
      
       (3) An FS server's address list, for the moment, has a single entry that
           is the key to the server list.  This will change in the future when a
           server is instead keyed on its UUID and the VL.GetAddrsU operation is
           used.
      
       (4) An 'address cursor' concept is introduced to handle iteration through
           the address list.  This is passed to the afs_make_call() as, in the
           future, stuff (such as abort code) that doesn't outlast the call will
           be returned in it.
      
      In the future, we might want to annotate the list with information about
      how each address fares.  We might then want to propagate such annotations
      over address list replacement.
      
      Whilst we're at it, we allow IPv6 addresses to be specified in
      colon-delimited lists by enclosing them in square brackets.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8b2a464c
    • D
      afs: Overhaul cell database management · 989782dc
      David Howells 提交于
      Overhaul the way that the in-kernel AFS client keeps track of cells in the
      following manner:
      
       (1) Cells are now held in an rbtree to make walking them quicker and RCU
           managed (though this is probably overkill).
      
       (2) Cells now have a manager work item that:
      
           (A) Looks after fetching and refreshing the VL server list.
      
           (B) Manages cell record lifetime, including initialising and
           	 destruction.
      
           (B) Manages cell record caching whereby threads are kept around for a
           	 certain time after last use and then destroyed.
      
           (C) Manages the FS-Cache index cookie for a cell.  It is not permitted
           	 for a cookie to be in use twice, so we have to be careful to not
           	 allow a new cell record to exist at the same time as an old record
           	 of the same name.
      
       (3) Each AFS network namespace is given a manager work item that manages
           the cells within it, maintaining a single timer to prod cells into
           updating their DNS records.
      
           This uses the reduce_timer() facility to make the timer expire at the
           soonest timed event that needs happening.
      
       (4) When a module is being unloaded, cells and cell managers are now
           counted out using dec_after_work() to make sure the module text is
           pinned until after the data structures have been cleaned up.
      
       (5) Each cell's VL server list is now protected by a seqlock rather than a
           semaphore.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      989782dc
    • D
      afs: Overhaul permit caching · be080a6f
      David Howells 提交于
      Overhaul permit caching in AFS by making it per-vnode and sharing permit
      lists where possible.
      
      When most of the fileserver operations are called, they return a status
      structure indicating the (revised) details of the vnode or vnodes involved
      in the operation.  This includes the access mark derived from the ACL
      (named CallerAccess in the protocol definition file).  This is cacheable
      and if the ACL changes, the server will tell us that it is breaking the
      callback promise, at which point we can discard the currently cached
      permits.
      
      With this patch, the afs_permits structure has, at the end, an array of
      { key, CallerAccess } elements, sorted by key pointer.  This is then cached
      in a hash table so that it can be shared between vnodes with the same
      access permits.
      
      Permit lists can only be shared if they contain the exact same set of
      key->CallerAccess mappings.
      
      Note that that table is global rather than being per-net_ns.  If the keys
      in a permit list cross net_ns boundaries, there is no problem sharing the
      cached permits, since the permits are just integer masks.
      
      Since permit lists pin keys, the permit cache also makes it easier for a
      future patch to find all occurrences of a key and remove them by means of
      setting the afs_permits::invalidated flag and then clearing the appropriate
      key pointer.  In such an event, memory barriers will need adding.
      
      Lastly, the permit caching is skipped if the server has sent either a
      vnode-specific or an entire-server callback since the start of the
      operation.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      be080a6f
    • D
      afs: Overhaul the callback handling · c435ee34
      David Howells 提交于
      Overhaul the AFS callback handling by the following means:
      
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
      
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
      
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
      
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
           ilookup_nowait().
      
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
      
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
      
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
      
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      
      This has the following consequences:
      
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
      
       (2) Mass invalidation isn't seen until someone calls afs_validate().
      
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c435ee34
    • D
      afs: Rename struct afs_call server member to cm_server · d0676a16
      David Howells 提交于
      Rename the server member of struct afs_call to cm_server as we're only
      going to be using it for incoming calls for the Cache Manager service.
      This makes it easier to differentiate from the pointer to the target server
      for the client, which will point to a different structure to allow for
      callback handling.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d0676a16
    • D
      afs: Potentially return call->reply[0] from afs_make_call() · 33cd7f2b
      David Howells 提交于
      If call->ret_reply0 is set, return call->reply[0] on success.  Change the
      return type of afs_make_call() to long so that this can be passed back
      without bit loss and then cast to a pointer if required.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      33cd7f2b
    • D
      afs: Condense afs_call's reply{,2,3,4} into an array · 97e3043a
      David Howells 提交于
      Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an
      array.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      97e3043a
    • D
      afs: Consolidate abort_to_error translators · f780c8ea
      David Howells 提交于
      The AFS abort code space is shared across all services, so there's no need
      for separate abort_to_error translators for each service.
      
      Consolidate them into a single function and remove the function pointers
      for them.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f780c8ea
    • D
      afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr · 4d9df986
      David Howells 提交于
      Keep and pass sockaddr_rxrpc addresses around rather than keeping and
      passing in_addr addresses to allow for the use of IPv6 and non-standard
      port numbers in future.
      
      This also allows the port and service_id fields to be removed from the
      afs_call struct.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4d9df986
    • D
      afs: Update the cache index structure · ad6a942a
      David Howells 提交于
      Update the cache index structure in the following ways:
      
       (1) Don't use the volume name followed by the volume type as levels in the
           cache index.  Volumes can be renamed.  Use the volume ID instead.
      
       (2) Don't store the VLDB data for a volume in the tree.  If the volume
           database should be cached locally, then it should be done in a separate
           tree.
      
       (3) Expand the volume ID stored in the cache to 64 bits.
      
       (4) Expand the file/vnode ID stored in the cache to 96 bits.
      
       (5) Increment the cache structure version number to 1.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ad6a942a
    • D
      afs: Push the net ns pointer to more places · 9ed900b1
      David Howells 提交于
      Push the network namespace pointer to more places in AFS, including the
      afs_server structure (which doesn't hold a ref on the netns).
      
      In particular, afs_put_cell() now takes requires a net ns parameter so that
      it can safely alter the netns after decrementing the cell usage count - the
      cell will be deallocated by a background thread after being cached for a
      period, which means that it's not safe to access it after reducing its
      usage count.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9ed900b1
    • D
      afs: Note the cell in the superblock info also · 49566f6f
      David Howells 提交于
      Keep a reference to the cell in the superblock info structure in addition
      to the volume and net pointers.  This will make it easier to clean up in a
      future patch in which afs_put_volume() will need the cell pointer.
      
      Whilst we're at it, make the cell and volume getting functions return a
      pointer to the object got to make the call sites look neater.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      49566f6f
    • D
      afs: Fix server reaping · 59fa1c4a
      David Howells 提交于
      Fix server reaping and make sure it's all done before we start trying to
      purge cells, given that servers currently pin cells.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      59fa1c4a
    • D
      afs: Lay the groundwork for supporting network namespaces · f044c884
      David Howells 提交于
      Lay the groundwork for supporting network namespaces (netns) to the AFS
      filesystem by moving various global features to a network-namespace struct
      (afs_net) and providing an instance of this as a temporary global variable
      that everything uses via accessor functions for the moment.
      
      The following changes have been made:
      
       (1) Store the netns in the superblock info.  This will be obtained from
           the mounter's nsproxy on a manual mount and inherited from the parent
           superblock on an automount.
      
       (2) The cell list is made per-netns.  It can be viewed through
           /proc/net/afs/cells and also be modified by writing commands to that
           file.
      
       (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
           This is unset by default.
      
       (4) The 'rootcell' module parameter, which sets a cell and VL server list
           modifies the init net namespace, thereby allowing an AFS root fs to be
           theoretically used.
      
       (5) The volume location lists and the file lock manager are made
           per-netns.
      
       (6) The AF_RXRPC socket and associated I/O bits are made per-ns.
      
      The various workqueues remain global for the moment.
      
      Changes still to be made:
      
       (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
           from the old name.
      
       (2) A per-netns subsys needs to be registered for AFS into which it can
           store its per-netns data.
      
       (3) Rather than the AF_RXRPC socket being opened on module init, it needs
           to be opened on the creation of a superblock in that netns.
      
       (4) The socket needs to be closed when the last superblock using it is
           destroyed and all outstanding client calls on it have been completed.
           This prevents a reference loop on the namespace.
      
       (5) It is possible that several namespaces will want to use AFS, in which
           case each one will need its own UDP port.  These can either be set
           through /proc/net/afs/cm_port or the kernel can pick one at random.
           The init_ns gets 7001 by default.
      
      Other issues that need resolving:
      
       (1) The DNS keyring needs net-namespacing.
      
       (2) Where do upcalls go (eg. DNS request-key upcall)?
      
       (3) Need something like open_socket_in_file_ns() syscall so that AFS
           command line tools attempting to operate on an AFS file/volume have
           their RPC calls go to the right place.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f044c884