1. 10 4月, 2018 6 次提交
    • D
      afs: Fix directory handling · f3ddee8d
      David Howells 提交于
      AFS directories are structured blobs that are downloaded just like files
      and then parsed by the lookup and readdir code and, as such, are currently
      handled in the pagecache like any other file, with the entire directory
      content being thrown away each time the directory changes.
      
      However, since the blob is a known structure and since the data version
      counter on a directory increases by exactly one for each change committed
      to that directory, we can actually edit the directory locally rather than
      fetching it from the server after each locally-induced change.
      
      What we can't do, though, is mix data from the server and data from the
      client since the server is technically at liberty to rearrange or compress
      a directory if it sees fit, provided it updates the data version number
      when it does so and breaks the callback (ie. sends a notification).
      
      Further, lookup with lookup-ahead, readdir and, when it arrives, local
      editing are likely want to scan the whole of a directory.
      
      So directory handling needs to be improved to maintain the coherency of the
      directory blob prior to permitting local directory editing.
      
      To this end:
      
       (1) If any directory page gets discarded, invalidate and reread the entire
           directory.
      
       (2) If readpage notes that if when it fetches a single page that the
           version number has changed, the entire directory is flagged for
           invalidation.
      
       (3) Read as much of the directory in one go as we can.
      
      Note that this removes local caching of directories in fscache for the
      moment as we can't pass the pages to fscache_read_or_alloc_pages() since
      page->lru is in use by the LRU.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f3ddee8d
    • D
      afs: Keep track of invalid-before version for dentry coherency · a4ff7401
      David Howells 提交于
      Each afs dentry is tagged with the version that the parent directory was at
      last time it was validated and, currently, if this differs, the directory
      is scanned and the dentry is refreshed.
      
      However, this leads to an excessive amount of revalidation on directories
      that get modified on the client without conflict with another client.  We
      know there's no conflict because the parent directory's data version number
      got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore
      we can trust the current state of the unaffected dentries when we perform a
      local directory modification.
      
      Optimise by keeping track of the last version of the parent directory that
      was changed outside of the client in the parent directory's vnode and using
      that to validate the dentries rather than the current version.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a4ff7401
    • D
      afs: Rearrange status mapping · dd9fbcb8
      David Howells 提交于
      Rearrange the AFSFetchStatus to inode attribute mapping code in a number of
      ways:
      
       (1) Use an XDR structure rather than a series of incremented pointer
           accesses when decoding an AFSFetchStatus object.  This allows
           out-of-order decode.
      
       (2) Don't store the if_version value but rather just check it and abort if
           it's not something we can handle.
      
       (3) Store the owner and group in the status record as raw values rather
           than converting them to kuid/kgid.  Do that when they're mapped into
           i_uid/i_gid.
      
       (4) Validate the type and abort code up front and abort if they're wrong.
      
       (5) Split the inode attribute setting out into its own function from the
           XDR decode of an AFSFetchStatus object.  This allows it to be called
           from elsewhere too.
      
       (6) Differentiate changes to data from changes to metadata.
      
       (7) Use the split-out attribute mapping function from afs_iget().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dd9fbcb8
    • D
      afs: Make it possible to get the data version in readpage · 0c3a5ac2
      David Howells 提交于
      Store the data version number indicated by an FS.FetchData op into the read
      request structure so that it's accessible by the page reader.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0c3a5ac2
    • D
      afs: Dump bad status record · 888b3384
      David Howells 提交于
      Dump an AFS FileStatus record that is detected as invalid.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      888b3384
    • D
      afs: Prospectively look up extra files when doing a single lookup · 5cf9dd55
      David Howells 提交于
      When afs_lookup() is called, prospectively look up the next 50 uncached
      fids also from that same directory and cache the results, rather than just
      looking up the one file requested.
      
      This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency
      by fetching up to 50 file statuses at a time.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5cf9dd55
  2. 29 1月, 2018 1 次提交
    • J
      afs: convert to new i_version API · a01179e6
      Jeff Layton 提交于
      For AFS, it's generally treated as an opaque value, so we use the
      *_raw variants of the API here.
      
      Note that AFS has quite a different definition for this counter. AFS
      only increments it on changes to the data to the data in regular files
      and contents of the directories. Inode metadata changes do not result
      in a version increment.
      
      We'll need to reconcile that somehow if we ever want to present this to
      userspace via statx.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      a01179e6
  3. 13 11月, 2017 10 次提交
    • D
      afs: Get rid of the afs_writeback record · 4343d008
      David Howells 提交于
      Get rid of the afs_writeback record that kAFS is using to match keys with
      writes made by that key.
      
      Instead, keep a list of keys that have a file open for writing and/or
      sync'ing and iterate through those.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4343d008
    • D
      afs: Trace the initiation and completion of client calls · 025db80c
      David Howells 提交于
      Add tracepoints to trace the initiation and completion of client calls
      within the kafs filesystem.
      
      The afs_make_vl_call tracepoint watches calls to the volume location
      database server.
      
      The afs_make_fs_call tracepoint watches calls to the file server.
      
      The afs_call_done tracepoint watches for call completion.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      025db80c
    • D
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells 提交于
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      servers.
      
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      
      To this end, the following structural changes are made:
      
       (1) Server record management is overhauled:
      
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
      
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
      
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
      
           (d) Server records are now keyed by their UUID within the namespace.
      
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           	 parameter.
      
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
      
           (g) The servers list is now in /proc/fs/afs/servers.
      
       (2) Volume record management is overhauled:
      
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
      
           (b) The superblock is now keyed on cell record and numeric volume ID.
      
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
      
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
      
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
      
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      
      and the following procedural changes are made:
      
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
      
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
      
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
      
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
      
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
      
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
      
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           	 moment.
      
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           	 salvaging.
      
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
      
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
      
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
      
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
      
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
      
      Notes:
      
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
      
       (2) VBUSY is retried forever for the moment at intervals of 1s.
      
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d2ddc776
    • D
      afs: Add an address list concept · 8b2a464c
      David Howells 提交于
      Add an RCU replaceable address list structure to hold a list of server
      addresses.  The list also holds the
      
      To this end:
      
       (1) A cell's VL server address list can be loaded directly via insmod or
           echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
           or SRV records.
      
       (2) Anyone wanting to use a cell's VL server address must wait until the
           cell record comes online and has tried to obtain some addresses.
      
       (3) An FS server's address list, for the moment, has a single entry that
           is the key to the server list.  This will change in the future when a
           server is instead keyed on its UUID and the VL.GetAddrsU operation is
           used.
      
       (4) An 'address cursor' concept is introduced to handle iteration through
           the address list.  This is passed to the afs_make_call() as, in the
           future, stuff (such as abort code) that doesn't outlast the call will
           be returned in it.
      
      In the future, we might want to annotate the list with information about
      how each address fares.  We might then want to propagate such annotations
      over address list replacement.
      
      Whilst we're at it, we allow IPv6 addresses to be specified in
      colon-delimited lists by enclosing them in square brackets.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8b2a464c
    • D
      afs: Overhaul permit caching · be080a6f
      David Howells 提交于
      Overhaul permit caching in AFS by making it per-vnode and sharing permit
      lists where possible.
      
      When most of the fileserver operations are called, they return a status
      structure indicating the (revised) details of the vnode or vnodes involved
      in the operation.  This includes the access mark derived from the ACL
      (named CallerAccess in the protocol definition file).  This is cacheable
      and if the ACL changes, the server will tell us that it is breaking the
      callback promise, at which point we can discard the currently cached
      permits.
      
      With this patch, the afs_permits structure has, at the end, an array of
      { key, CallerAccess } elements, sorted by key pointer.  This is then cached
      in a hash table so that it can be shared between vnodes with the same
      access permits.
      
      Permit lists can only be shared if they contain the exact same set of
      key->CallerAccess mappings.
      
      Note that that table is global rather than being per-net_ns.  If the keys
      in a permit list cross net_ns boundaries, there is no problem sharing the
      cached permits, since the permits are just integer masks.
      
      Since permit lists pin keys, the permit cache also makes it easier for a
      future patch to find all occurrences of a key and remove them by means of
      setting the afs_permits::invalidated flag and then clearing the appropriate
      key pointer.  In such an event, memory barriers will need adding.
      
      Lastly, the permit caching is skipped if the server has sent either a
      vnode-specific or an entire-server callback since the start of the
      operation.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      be080a6f
    • D
      afs: Overhaul the callback handling · c435ee34
      David Howells 提交于
      Overhaul the AFS callback handling by the following means:
      
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
      
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
      
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
      
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
           ilookup_nowait().
      
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
      
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
      
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
      
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      
      This has the following consequences:
      
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
      
       (2) Mass invalidation isn't seen until someone calls afs_validate().
      
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c435ee34
    • D
      afs: Condense afs_call's reply{,2,3,4} into an array · 97e3043a
      David Howells 提交于
      Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an
      array.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      97e3043a
    • D
      afs: Consolidate abort_to_error translators · f780c8ea
      David Howells 提交于
      The AFS abort code space is shared across all services, so there's no need
      for separate abort_to_error translators for each service.
      
      Consolidate them into a single function and remove the function pointers
      for them.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f780c8ea
    • D
      afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr · 4d9df986
      David Howells 提交于
      Keep and pass sockaddr_rxrpc addresses around rather than keeping and
      passing in_addr addresses to allow for the use of IPv6 and non-standard
      port numbers in future.
      
      This also allows the port and service_id fields to be removed from the
      afs_call struct.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4d9df986
    • D
      afs: Lay the groundwork for supporting network namespaces · f044c884
      David Howells 提交于
      Lay the groundwork for supporting network namespaces (netns) to the AFS
      filesystem by moving various global features to a network-namespace struct
      (afs_net) and providing an instance of this as a temporary global variable
      that everything uses via accessor functions for the moment.
      
      The following changes have been made:
      
       (1) Store the netns in the superblock info.  This will be obtained from
           the mounter's nsproxy on a manual mount and inherited from the parent
           superblock on an automount.
      
       (2) The cell list is made per-netns.  It can be viewed through
           /proc/net/afs/cells and also be modified by writing commands to that
           file.
      
       (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
           This is unset by default.
      
       (4) The 'rootcell' module parameter, which sets a cell and VL server list
           modifies the init net namespace, thereby allowing an AFS root fs to be
           theoretically used.
      
       (5) The volume location lists and the file lock manager are made
           per-netns.
      
       (6) The AF_RXRPC socket and associated I/O bits are made per-ns.
      
      The various workqueues remain global for the moment.
      
      Changes still to be made:
      
       (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
           from the old name.
      
       (2) A per-netns subsys needs to be registered for AFS into which it can
           store its per-netns data.
      
       (3) Rather than the AF_RXRPC socket being opened on module init, it needs
           to be opened on the creation of a superblock in that netns.
      
       (4) The socket needs to be closed when the last superblock using it is
           destroyed and all outstanding client calls on it have been completed.
           This prevents a reference loop on the namespace.
      
       (5) It is possible that several namespaces will want to use AFS, in which
           case each one will need its own UDP port.  These can either be set
           through /proc/net/afs/cm_port or the kernel can pick one at random.
           The init_ns gets 7001 by default.
      
      Other issues that need resolving:
      
       (1) The DNS keyring needs net-namespacing.
      
       (2) Where do upcalls go (eg. DNS request-key upcall)?
      
       (3) Need something like open_socket_in_file_ns() syscall so that AFS
           command line tools attempting to operate on an AFS file/volume have
           their RPC calls go to the right place.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f044c884
  4. 17 3月, 2017 7 次提交
    • M
      afs: Populate and use client modification time · ab94f5d0
      Marc Dionne 提交于
      The inode timestamps should be set from the client time
      in the status received from the server, rather than the
      server time which is meant for internal server use.
      
      Set AFS_SET_MTIME and populate the mtime for operations
      that take an input status, such as file/dir creation
      and StoreData.  If an input time is not provided the
      server will set the vnode times based on the current server
      time.
      
      In a situation where the server has some skew with the
      client, this could lead to the client seeing a timestamp
      in the future for a file that it just created or wrote.
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ab94f5d0
    • D
      afs: Fix the maths in afs_fs_store_data() · 146a1192
      David Howells 提交于
      afs_fs_store_data() works out of the size of the write it's going to make,
      but it uses 32-bit unsigned subtraction in one place that gets
      automatically cast to loff_t.
      
      However, if to < offset, then the number goes negative, but as the result
      isn't signed, this doesn't get sign-extended to 64-bits when placed in a
      loff_t.
      
      Fix by casting the operands to loff_t.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      146a1192
    • D
      afs: Make struct afs_read::remain 64-bit · 6a0e3999
      David Howells 提交于
      Make struct afs_read::remain 64-bit so that it can handle huge transfers if
      we ever request them or the server decides to give us a bit extra data (the
      other fields there are already 64-bit).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      6a0e3999
    • D
      afs: Fix AFS read bug · 29f06985
      David Howells 提交于
      Fix a bug in AFS read whereby the request page afs_read::index isn't
      incremented after calling ->page_done() if ->remain reaches 0, indicating
      that the data read is complete.
      
      Without this a NULL pointer exception happens when ->page_done() is called
      twice for the last page because the page clearing loop will call it also
      and afs_readpages_page_done() clears the current entry in the page list.
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: afs_readpages_page_done+0x21/0xa4 [kafs]
      PGD 0
      Oops: 0002 [#1] SMP
      Modules linked in: kafs(E)
      CPU: 2 PID: 3002 Comm: md5sum Tainted: G            E   4.10.0-fscache #485
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      task: ffff8804017d86c0 task.stack: ffff8803fc1d8000
      RIP: 0010:afs_readpages_page_done+0x21/0xa4 [kafs]
      RSP: 0018:ffff8803fc1db978 EFLAGS: 00010282
      RAX: ffff880405d39af8 RBX: 0000000000000000 RCX: ffff880407d83ed4
      RDX: 0000000000000000 RSI: ffff880405d39a00 RDI: ffff880405c6f400
      RBP: ffff8803fc1db988 R08: 0000000000000000 R09: 0000000000000001
      R10: ffff8803fc1db820 R11: ffff88040cf56000 R12: ffff8804088f1780
      R13: ffff8804017d86c0 R14: ffff8804088f1780 R15: 0000000000003840
      FS:  00007f8154469700(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 00000004016ec000 CR4: 00000000001406e0
      Call Trace:
       afs_deliver_fs_fetch_data+0x5b9/0x60e [kafs]
       ? afs_make_call+0x316/0x4e8 [kafs]
       ? afs_make_call+0x359/0x4e8 [kafs]
       afs_deliver_to_call+0x173/0x2e8 [kafs]
       ? afs_make_call+0x316/0x4e8 [kafs]
       afs_make_call+0x37a/0x4e8 [kafs]
       ? wake_up_q+0x4f/0x4f
       ? __init_waitqueue_head+0x36/0x49
       afs_fs_fetch_data+0x21c/0x227 [kafs]
       ? afs_fs_fetch_data+0x21c/0x227 [kafs]
       afs_vnode_fetch_data+0xf3/0x1d2 [kafs]
       afs_readpages+0x314/0x3fd [kafs]
       __do_page_cache_readahead+0x208/0x2c5
       ondemand_readahead+0x3a2/0x3b7
       ? ondemand_readahead+0x3a2/0x3b7
       page_cache_async_readahead+0x5e/0x67
       generic_file_read_iter+0x23b/0x70c
       ? __inode_security_revalidate+0x2f/0x62
       __vfs_read+0xc4/0xe8
       vfs_read+0xd1/0x15a
       SyS_read+0x4c/0x89
       do_syscall_64+0x80/0x191
       entry_SYSCALL64_slow_path+0x25/0x25
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarc Dionne <marc.dionne@auristor.com>
      29f06985
    • T
      afs: Prevent callback expiry timer overflow · 56e71431
      Tina Ruchandani 提交于
      get_seconds() returns real wall-clock seconds. On 32-bit systems
      this value will overflow in year 2038 and beyond. This patch changes
      afs_vnode record to use ktime_get_real_seconds() instead, for the
      fields cb_expires and cb_expires_at.
      Signed-off-by: NTina Ruchandani <ruchandani.tina@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      56e71431
    • D
      afs: Handle a short write to an AFS page · e8e581a8
      David Howells 提交于
      Handle the situation where afs_write_begin() is told to expect that a
      full-page write will be made, but this doesn't happen (EFAULT, CTRL-C,
      etc.), and so afs_write_end() sees a partial write took place.  Currently,
      no attempt is to deal with the discrepency.
      
      Fix this by loading the gap from the server.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e8e581a8
    • D
      afs: Handle better the server returning excess or short data · 6db3ac3c
      David Howells 提交于
      When an AFS server is given an FS.FetchData{,64} request to read data from
      a file, it is permitted by the protocol to return more or less than was
      requested.  kafs currently relies on the latter behaviour in readpage{,s}
      to handle a partial page at the end of the file (we just ask for a whole
      page and clear space beyond the short read).
      
      However, we don't handle all cases.  Add:
      
       (1) Handle excess data by discarding it rather than aborting.  Note that
           we use a common static buffer to discard into so that the decryption
           algorithm advances the PCBC state.
      
       (2) Handle a short read that affects more than just the last page.
      
      Note that if a read comes up unexpectedly short of long, it's possible that
      the server's copy of the file changed - in which case the data version
      number will have been incremented and the callback will have been broken -
      in which case all the pages currently attached to the inode will be zapped
      anyway at some point.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6db3ac3c
  5. 09 1月, 2017 1 次提交
    • D
      afs: Kill afs_wait_mode · 56ff9c83
      David Howells 提交于
      The afs_wait_mode struct isn't really necessary.  Client calls only use one
      of a choice of two (synchronous or the asynchronous) and incoming calls
      don't use the wait at all.  Replace with a boolean parameter.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      56ff9c83
  6. 07 1月, 2017 1 次提交
  7. 13 10月, 2016 1 次提交
  8. 02 9月, 2016 1 次提交
    • D
      rxrpc: Don't expose skbs to in-kernel users [ver #2] · d001648e
      David Howells 提交于
      Don't expose skbs to in-kernel users, such as the AFS filesystem, but
      instead provide a notification hook the indicates that a call needs
      attention and another that indicates that there's a new call to be
      collected.
      
      This makes the following possibilities more achievable:
      
       (1) Call refcounting can be made simpler if skbs don't hold refs to calls.
      
       (2) skbs referring to non-data events will be able to be freed much sooner
           rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
           will be able to consult the call state.
      
       (3) We can shortcut the receive phase when a call is remotely aborted
           because we don't have to go through all the packets to get to the one
           cancelling the operation.
      
       (4) It makes it easier to do encryption/decryption directly between AFS's
           buffers and sk_buffs.
      
       (5) Encryption/decryption can more easily be done in the AFS's thread
           contexts - usually that of the userspace process that issued a syscall
           - rather than in one of rxrpc's background threads on a workqueue.
      
       (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
      
      To make this work, the following interface function has been added:
      
           int rxrpc_kernel_recv_data(
      		struct socket *sock, struct rxrpc_call *call,
      		void *buffer, size_t bufsize, size_t *_offset,
      		bool want_more, u32 *_abort_code);
      
      This is the recvmsg equivalent.  It allows the caller to find out about the
      state of a specific call and to transfer received data into a buffer
      piecemeal.
      
      afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
      logic between them.  They don't wait synchronously yet because the socket
      lock needs to be dealt with.
      
      Five interface functions have been removed:
      
      	rxrpc_kernel_is_data_last()
          	rxrpc_kernel_get_abort_code()
          	rxrpc_kernel_get_error_number()
          	rxrpc_kernel_free_skb()
          	rxrpc_kernel_data_consumed()
      
      As a temporary hack, sk_buffs going to an in-kernel call are queued on the
      rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
      in-kernel user.  To process the queue internally, a temporary function,
      temp_deliver_data() has been added.  This will be replaced with common code
      between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
      future patch.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d001648e
  9. 06 8月, 2016 1 次提交
    • D
      rxrpc: Fix races between skb free, ACK generation and replying · 372ee163
      David Howells 提交于
      Inside the kafs filesystem it is possible to occasionally have a call
      processed and terminated before we've had a chance to check whether we need
      to clean up the rx queue for that call because afs_send_simple_reply() ends
      the call when it is done, but this is done in a workqueue item that might
      happen to run to completion before afs_deliver_to_call() completes.
      
      Further, it is possible for rxrpc_kernel_send_data() to be called to send a
      reply before the last request-phase data skb is released.  The rxrpc skb
      destructor is where the ACK processing is done and the call state is
      advanced upon release of the last skb.  ACK generation is also deferred to
      a work item because it's possible that the skb destructor is not called in
      a context where kernel_sendmsg() can be invoked.
      
      To this end, the following changes are made:
      
       (1) kernel_rxrpc_data_consumed() is added.  This should be called whenever
           an skb is emptied so as to crank the ACK and call states.  This does
           not release the skb, however.  kernel_rxrpc_free_skb() must now be
           called to achieve that.  These together replace
           rxrpc_kernel_data_delivered().
      
       (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
      
           This makes afs_deliver_to_call() easier to work as the skb can simply
           be discarded unconditionally here without trying to work out what the
           return value of the ->deliver() function means.
      
           The ->deliver() functions can, via afs_data_complete(),
           afs_transfer_reply() and afs_extract_data() mark that an skb has been
           consumed (thereby cranking the state) without the need to
           conditionally free the skb to make sure the state is correct on an
           incoming call for when the call processor tries to send the reply.
      
       (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
           has finished with a packet and MSG_PEEK isn't set.
      
       (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
      
           Because of this, we no longer need to clear the destructor and put the
           call before we free the skb in cases where we don't want the ACK/call
           state to be cranked.
      
       (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
           than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
           the delivery function already), and the caller is now responsible for
           producing an abort if that was the last packet.
      
       (6) There are many bits of unmarshalling code where:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		switch (ret) {
      		case 0:		break;
      		case -EAGAIN:	return 0;
      		default:	return ret;
      		}
      
           is to be found.  As -EAGAIN can now be passed back to the caller, we
           now just return if ret < 0:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		if (ret < 0)
      			return ret;
      
       (7) Checks for trailing data and empty final data packets has been
           consolidated as afs_data_complete().  So:
      
      		if (skb->len > 0)
      			return -EBADMSG;
      		if (!last)
      			return 0;
      
           becomes:
      
      		ret = afs_data_complete(call, skb, last);
      		if (ret < 0)
      			return ret;
      
       (8) afs_transfer_reply() now checks the amount of data it has against the
           amount of data desired and the amount of data in the skb and returns
           an error to induce an abort if we don't get exactly what we want.
      
      Without these changes, the following oops can occasionally be observed,
      particularly if some printks are inserted into the delivery path:
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
      CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G            E   4.7.0-fsdevel+ #1303
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Workqueue: kafsd afs_async_workfn [kafs]
      task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
      RIP: 0010:[<ffffffff8108fd3c>]  [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
      RSP: 0018:ffff88040c073bc0  EFLAGS: 00010002
      RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
      RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
      FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
      Stack:
       0000000000000006 000000000be04930 0000000000000000 ffff880400000000
       ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
       ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
      Call Trace:
       [<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
       [<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
       [<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
       [<ffffffff810915f4>] lock_acquire+0x122/0x1b6
       [<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff814c928f>] skb_dequeue+0x18/0x61
       [<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
       [<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
       [<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
       [<ffffffff81063a3a>] process_one_work+0x29d/0x57c
       [<ffffffff81064ac2>] worker_thread+0x24a/0x385
       [<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
       [<ffffffff810696f5>] kthread+0xf3/0xfb
       [<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
       [<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      372ee163
  10. 13 2月, 2013 1 次提交
    • E
      afs: Support interacting with multiple user namespaces · a0a5386a
      Eric W. Biederman 提交于
      Modify struct afs_file_status to store owner as a kuid_t and group as
      a kgid_t.
      
      In xdr_decode_AFSFetchStatus as owner is now a kuid_t and group is now
      a kgid_t don't use the EXTRACT macro.  Instead perform the work of
      the extract macro explicitly.  Read the value with ntohl and
      convert it to the appropriate type with make_kuid or make_kgid.
      Test if the value is different from what is stored in status and
      update changed.   Update the value in status.
      
      In xdr_encode_AFS_StoreStatus call from_kuid or from_kgid as
      we are computing the on the wire encoding.
      
      Initialize uids with GLOBAL_ROOT_UID instead of 0.
      Initialize gids with GLOBAL_ROOT_GID instead of 0.
      
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      a0a5386a
  11. 20 3月, 2012 1 次提交
  12. 02 11月, 2011 1 次提交
  13. 16 6月, 2011 1 次提交
  14. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  15. 17 7月, 2007 1 次提交
  16. 11 5月, 2007 2 次提交
  17. 10 5月, 2007 2 次提交
    • D
      AFS: implement basic file write support · 31143d5d
      David Howells 提交于
      Implement support for writing to regular AFS files, including:
      
       (1) write
      
       (2) truncate
      
       (3) fsync, fdatasync
      
       (4) chmod, chown, chgrp, utime.
      
      AFS writeback attempts to batch writes into as chunks as large as it can manage
      up to the point that it writes back 65535 pages in one chunk or it meets a
      locked page.
      
      Furthermore, if a page has been written to using a particular key, then should
      another write to that page use some other key, the first write will be flushed
      before the second is allowed to take place.  If the first write fails due to a
      security error, then the page will be scrapped and reread before the second
      write takes place.
      
      If a page is dirty and the callback on it is broken by the server, then the
      dirty data is not discarded (same behaviour as NFS).
      
      Shared-writable mappings are not supported by this patch.
      
      [akpm@linux-foundation.org: fix a bunch of warnings]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31143d5d
    • D
      AFS: AFS fixups · 416351f2
      David Howells 提交于
      Make some miscellaneous changes to the AFS filesystem:
      
       (1) Assert RCU barriers on module exit to make sure RCU has finished with
           callbacks in this module.
      
       (2) Correctly handle the AFS server returning a zero-length read.
      
       (3) Split out data zapping calls into one function (afs_zap_data).
      
       (4) Rename some afs_file_*() functions to afs_*() where they apply to
           non-regular files too.
      
       (5) Be consistent about the presentation of volume ID:vnode ID in debugging
           output.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      416351f2
  18. 03 5月, 2007 1 次提交
    • D
      [AFS/AF_RXRPC]: Miscellaneous fixes. · 80c72fe4
      David Howells 提交于
      Make miscellaneous fixes to AFS and AF_RXRPC:
      
       (*) Make AF_RXRPC select KEYS rather than RXKAD or AFS_FS in Kconfig.
      
       (*) Don't use FS_BINARY_MOUNTDATA.
      
       (*) Remove a done 'TODO' item in a comemnt on afs_get_sb().
      
       (*) Don't pass a void * as the page pointer argument of kmap_atomic() as this
           breaks on m68k.  Patch from Geert Uytterhoeven <geert@linux-m68k.org>.
      
       (*) Use match_*() functions rather than doing my own parsing.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80c72fe4