1. 25 7月, 2018 1 次提交
    • K
      cachefiles: Fix refcounting bug in backing-file read monitoring · 934140ab
      Kiran Kumar Modukuri 提交于
      cachefiles_read_waiter() has the right to access a 'monitor' object by
      virtue of being called under the waitqueue lock for one of the pages in its
      purview.  However, it has no ref on that monitor object or on the
      associated operation.
      
      What it is allowed to do is to move the monitor object to the operation's
      to_do list, but once it drops the work_lock, it's actually no longer
      permitted to access that object.  However, it is trying to enqueue the
      retrieval operation for processing - but it can only do this via a pointer
      in the monitor object, something it shouldn't be doing.
      
      If it doesn't enqueue the operation, the operation may not get processed.
      If the order is flipped so that the enqueue is first, then it's possible
      for the work processor to look at the to_do list before the monitor is
      enqueued upon it.
      
      Fix this by getting a ref on the operation so that we can trust that it
      will still be there once we've added the monitor to the to_do list and
      dropped the work_lock.  The op can then be enqueued after the lock is
      dropped.
      
      The bug can manifest in one of a couple of ways.  The first manifestation
      looks like:
      
       FS-Cache:
       FS-Cache: Assertion failed
       FS-Cache: 6 == 5 is false
       ------------[ cut here ]------------
       kernel BUG at fs/fscache/operation.c:494!
       RIP: 0010:fscache_put_operation+0x1e3/0x1f0
       ...
       fscache_op_work_func+0x26/0x50
       process_one_work+0x131/0x290
       worker_thread+0x45/0x360
       kthread+0xf8/0x130
       ? create_worker+0x190/0x190
       ? kthread_cancel_work_sync+0x10/0x10
       ret_from_fork+0x1f/0x30
      
      This is due to the operation being in the DEAD state (6) rather than
      INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
      fscache_put_operation().
      
      The bug can also manifest like the following:
      
       kernel BUG at fs/fscache/operation.c:69!
       ...
          [exception RIP: fscache_enqueue_operation+246]
       ...
       #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
       #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
       #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028
      
      I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
      entirely clear which assertion failed.
      
      Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
      Reported-by: NLei Xue <carmark.dlut@gmail.com>
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      Reported-by: NAnthony DeRobertis <aderobertis@metrics.net>
      Reported-by: NNeilBrown <neilb@suse.com>
      Reported-by: NDaniel Axtens <dja@axtens.net>
      Reported-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      934140ab
  2. 04 4月, 2018 1 次提交
  3. 16 11月, 2017 2 次提交
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • M
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman 提交于
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86679820
  4. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  5. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 17 11月, 2015 1 次提交
  7. 11 11月, 2015 2 次提交
    • D
      FS-Cache: Handle a write to the page immediately beyond the EOF marker · 102f4d90
      David Howells 提交于
      Handle a write being requested to the page immediately beyond the EOF
      marker on a cache object.  Currently this gets an assertion failure in
      CacheFiles because the EOF marker is used there to encode information about
      a partial page at the EOF - which could lead to an unknown blank spot in
      the file if we extend the file over it.
      
      The problem is actually in fscache where we check the index of the page
      being written against store_limit.  store_limit is set to the number of
      pages that we're allowed to store by fscache_set_store_limit() - which
      means it's one more than the index of the last page we're allowed to store.
      The problem is that we permit writing to a page with an index _equal_ to
      the store limit - when we should reject that case.
      
      Whilst we're at it, change the triggered assertion in CacheFiles to just
      return -ENOBUFS instead.
      
      The assertion failure looks something like this:
      
      CacheFiles: Assertion failed
      1000 < 7b1 is false
      ------------[ cut here ]------------
      kernel BUG at fs/cachefiles/rdwr.c:962!
      ...
      RIP: 0010:[<ffffffffa02c9e83>]  [<ffffffffa02c9e83>] cachefiles_write_page+0x273/0x2d0 [cachefiles]
      
      Cc: stable@vger.kernel.org # v2.6.31+; earlier - that + backport of a17754fb (at least)
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      102f4d90
    • N
      cachefiles: perform test on s_blocksize when opening cache file. · 95201a40
      NeilBrown 提交于
      cachefiles requires that s_blocksize in the cache is not greater than
      PAGE_SIZE, and performs the check every time a block is accessed.
      
      Move the test to the place where the file is "opened", where other
      file-validity tests are performed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95201a40
  8. 16 4月, 2015 1 次提交
  9. 23 2月, 2015 1 次提交
  10. 09 10月, 2014 1 次提交
  11. 18 9月, 2014 1 次提交
  12. 04 4月, 2014 1 次提交
    • J
      fs: cachefiles: use add_to_page_cache_lru() · 55881bc7
      Johannes Weiner 提交于
      This code used to have its own lru cache pagevec up until a0b8cab3 ("mm:
      remove lru parameter from __pagevec_lru_add and remove parts of pagevec
      API").  Now it's just add_to_page_cache() followed by lru_cache_add(),
      might as well use add_to_page_cache_lru() directly.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55881bc7
  13. 04 7月, 2013 1 次提交
  14. 10 4月, 2013 1 次提交
  15. 21 12月, 2012 6 次提交
    • D
      CacheFiles: Add missing retrieval completions · b4cf1e08
      David Howells 提交于
      CacheFiles is missing some calls to fscache_retrieval_complete() in the error
      handling/collision paths of its reader functions.
      
      This can be seen by the following assertion tripping in fscache_put_operation()
      whereby the operation being destroyed is still in the in-progress state and has
      not been cancelled or completed:
      
      FS-Cache: Assertion failed
      3 == 5 is false
      ------------[ cut here ]------------
      kernel BUG at fs/fscache/operation.c:408!
      invalid opcode: 0000 [#1] SMP
      CPU 2
      Modules linked in: xfs ioatdma dca loop joydev evdev
      psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
      pci_hotplug sg sr_mod]
      
      Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
      RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
      RSP: 0018:ffff880062f739d8  EFLAGS: 00010296
      RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
      RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
      RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
      R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
      FS:  00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
      Stack:
       ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
       ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
       ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
      Call Trace:
       [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
       [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
       [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
       [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
       [<ffffffff8122755b>] ? alloc_nfs_open_context+0x4b/0x110
       [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
       [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
       [<ffffffff810e7f61>] ra_submit+0x21/0x30
       [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
       [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
       [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
       [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
       [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
       [<ffffffff8114088f>] ? mntput+0x1f/0x40
       [<ffffffff811238cb>] ? fput+0x1cb/0x260
       [<ffffffff81122938>] vfs_read+0xc8/0x180
       [<ffffffff81122af5>] sys_read+0x55/0x90
      Reported-by: NMark Moseley <moseleymark@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      b4cf1e08
    • D
      CacheFiles: Implement invalidation · 9dc8d9bf
      David Howells 提交于
      Implement invalidation for CacheFiles.  This is in two parts:
      
       (1) Provide an invalidation method (which just truncates the backing file).
      
       (2) Abort attempts to copy anything read from the backing file whilst
           invalidation is in progress.
      
      Question: CacheFiles uses truncation in a couple of places.  It has been using
      notify_change() rather than sys_truncate() or something similar.  This means
      it bypasses a bunch of checks and suchlike that it possibly should be making
      (security, file locking, lease breaking, vfsmount write).  Should it be using
      vfs_truncate() as added by a preceding patch or should it use notify_write()
      and assume that anyone poking around in the cache files on disk gets
      everything they deserve?
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9dc8d9bf
    • D
      FS-Cache: Fix operation state management and accounting · 9f10523f
      David Howells 提交于
      Fix the state management of internal fscache operations and the accounting of
      what operations are in what states.
      
      This is done by:
      
       (1) Give struct fscache_operation a enum variable that directly represents the
           state it's currently in, rather than spreading this knowledge over a bunch
           of flags, who's processing the operation at the moment and whether it is
           queued or not.
      
           This makes it easier to write assertions to check the state at various
           points and to prevent invalid state transitions.
      
       (2) Add an 'operation complete' state and supply a function to indicate the
           completion of an operation (fscache_op_complete()) and make things call
           it.  The final call to fscache_put_operation() can then check that an op
           in the appropriate state (complete or cancelled).
      
       (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
           govern the state of an object:
      
      	(a) The ->n_ops is now the number of extant operations on the object
      	    and is now decremented by fscache_put_operation() only.
      
      	(b) The ->n_in_progress is simply the number of objects that have been
      	    taken off of the object's pending queue for the purposes of being
      	    run.  This is decremented by fscache_op_complete() only.
      
      	(c) The ->n_exclusive is the number of exclusive ops that have been
      	    submitted and queued or are in progress.  It is decremented by
      	    fscache_op_complete() and by fscache_cancel_op().
      
           fscache_put_operation() and fscache_operation_gc() now no longer try to
           clean up ->n_exclusive and ->n_in_progress.  That was leading to double
           decrements against fscache_cancel_op().
      
           fscache_cancel_op() now no longer decrements ->n_ops.  That was leading to
           double decrements against fscache_put_operation().
      
           fscache_submit_exclusive_op() now decides whether it has to queue an op
           based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
           will persist in being true even after all preceding operations have been
           cancelled or completed.  Furthermore, if an object is active and there are
           runnable ops against it, there must be at least one op running.
      
       (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
           provide a function to record completion of the pages as they complete.
      
           When n_pages reaches 0, the operation is deemed to be complete and
           fscache_op_complete() is called.
      
           Add calls to fscache_retrieval_complete() anywhere we've finished with a
           page we've been given to read or allocate for.  This includes places where
           we just return pages to the netfs for reading from the server and where
           accessing the cache fails and we discard the proposed netfs page.
      
      The bugs in the unfixed state management manifest themselves as oopses like the
      following where the operation completion gets out of sync with return of the
      cookie by the netfs.  This is possible because the cache unlocks and returns
      all the netfs pages before recording its completion - which means that there's
      nothing to stop the netfs discarding them and returning the cookie.
      
      
      FS-Cache: Cookie 'NFS.fh' still has outstanding reads
      ------------[ cut here ]------------
      kernel BUG at fs/fscache/cookie.c:519!
      invalid opcode: 0000 [#1] SMP
      CPU 1
      Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
      
      Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090                  /DG965RY
      RIP: 0010:[<ffffffffa007050a>]  [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache]
      RSP: 0018:ffff8800368cfb00  EFLAGS: 00010282
      RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
      RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
      RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
      R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
      R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
      FS:  0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
      Stack:
       ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
       ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
       ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
      Call Trace:
       [<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
       [<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs]
       [<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs]
       [<ffffffff810d8d47>] evict+0xa1/0x15c
       [<ffffffff810d8e2e>] dispose_list+0x2c/0x38
       [<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b
       [<ffffffff810c56b7>] prune_super+0xd5/0x140
       [<ffffffff8109b615>] shrink_slab+0x102/0x1ab
       [<ffffffff8109d690>] balance_pgdat+0x2f2/0x595
       [<ffffffff8103e009>] ? process_timeout+0xb/0xb
       [<ffffffff8109dba3>] kswapd+0x270/0x289
       [<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46
       [<ffffffff8109d933>] ? balance_pgdat+0x595/0x595
       [<ffffffff8104bf7a>] kthread+0x7f/0x87
       [<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10
       [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
       [<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe
       [<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53
       [<ffffffff813ad6b0>] ? gs_change+0xb/0xb
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9f10523f
    • D
      CacheFiles: Make some debugging statements conditional · 37491a13
      David Howells 提交于
      Downgrade some debugging statements to not unconditionally print stuff, but
      rather be conditional on the appropriate module parameter setting.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      37491a13
    • D
      CacheFiles: Downgrade the requirements passed to the allocator · 5f4f9f4a
      David Howells 提交于
      Downgrade the requirements passed to the allocator in the gfp flags parameter.
      FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
      store an object or a page in the cache.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5f4f9f4a
    • D
      CacheFiles: Fix the marking of cached pages · c4d6d8db
      David Howells 提交于
      Under some circumstances CacheFiles defers the marking of pages with PG_fscache
      so that it can take advantage of pagevecs to reduce the number of calls to
      fscache_mark_pages_cached() and the netfs's hook to keep track of this.
      
      There are, however, two problems with this:
      
       (1) It can lead to the PG_fscache mark being applied _after_ the page is set
           PG_uptodate and unlocked (by the call to fscache_end_io()).
      
       (2) CacheFiles's ref on the page is dropped immediately following
           fscache_end_io() - and so may not still be held when the mark is applied.
           This can lead to the page being passed back to the allocator before the
           mark is applied.
      
      Fix this by, where appropriate, marking the page before calling
      fscache_end_io() and releasing the page.  This means that we can't take
      advantage of pagevecs and have to make a separate call for each page to the
      marking routines.
      
      The symptoms of this are Bad Page state errors cropping up under memory
      pressure, for example:
      
      BUG: Bad page state in process tar  pfn:002da
      page:ffffea0000009fb0 count:0 mapcount:0 mapping:          (null) index:0x1447
      page flags: 0x1000(private_2)
      Pid: 4574, comm: tar Tainted: G        W   3.1.0-rc4-fsdevel+ #1064
      Call Trace:
       [<ffffffff8109583c>] ? dump_page+0xb9/0xbe
       [<ffffffff81095916>] bad_page+0xd5/0xea
       [<ffffffff81095d82>] get_page_from_freelist+0x35b/0x46a
       [<ffffffff810961f3>] __alloc_pages_nodemask+0x362/0x662
       [<ffffffff810989da>] __do_page_cache_readahead+0x13a/0x267
       [<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
       [<ffffffff81098d7b>] ra_submit+0x1c/0x20
       [<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
       [<ffffffff81098ee2>] ? ondemand_readahead+0x163/0x29a
       [<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
       [<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
       [<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
       [<ffffffff810c22c4>] do_sync_read+0xba/0xfa
       [<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
       [<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
       [<ffffffff810c29a4>] vfs_read+0xaa/0x13a
       [<ffffffff810c2a79>] sys_read+0x45/0x6c
       [<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b
      
      As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.
      
      Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
      set appropriately showed that sometimes it wasn't.  This led to the discovery
      that sometimes the page has apparently been reclaimed by the time the marker
      got to see it.
      Reported-by: NM. Stevens <m@tippett.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      c4d6d8db
  16. 31 7月, 2012 1 次提交
  17. 23 7月, 2012 1 次提交
  18. 23 7月, 2010 1 次提交
    • T
      fscache: convert operation to use workqueue instead of slow-work · 8af7c124
      Tejun Heo 提交于
      Make fscache operation to use only workqueue instead of combination of
      workqueue and slow-work.  FSCACHE_OP_SLOW is dropped and
      FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
      fscache_op_wq workqueue to execute op->processor().
      fscache_operation_init_slow() is dropped and fscache_operation_init()
      now takes @processor argument directly.
      
      * Unbound workqueue is used.
      
      * fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
        the equivalent thing.
      
      * sysctl fscache.operation_max_active added to control concurrency.
        The default value is nr_cpus clamped between 2 and
        WQ_UNBOUND_MAX_ACTIVE.
      
      * debugfs support is dropped for now.  Tracing API based debug
        facility is planned to be added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      8af7c124
  19. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  20. 17 12月, 2009 1 次提交
  21. 01 12月, 2009 1 次提交
  22. 20 11月, 2009 3 次提交
    • D
      CacheFiles: Handle truncate unlocking the page we're reading · 5e929b33
      David Howells 提交于
      Handle truncate unlocking the page we're attempting to read from the backing
      device before the read has completed.
      
      This was causing reports like the following to occur:
      
      	Pid: 4765, comm: kslowd Not tainted 2.6.30.1 #1
      	Call Trace:
      	 [<ffffffffa0331d7a>] ? cachefiles_read_waiter+0xd9/0x147 [cachefiles]
      	 [<ffffffff804b74bd>] ? __wait_on_bit+0x60/0x6f
      	 [<ffffffff8022bbbb>] ? __wake_up_common+0x3f/0x71
      	 [<ffffffff8022cc32>] ? __wake_up+0x30/0x44
      	 [<ffffffff8024a41f>] ? __wake_up_bit+0x28/0x2d
      	 [<ffffffffa003a793>] ? ext3_truncate+0x4d7/0x8ed [ext3]
      	 [<ffffffff80281f90>] ? pagevec_lookup+0x17/0x1f
      	 [<ffffffff8028c2ff>] ? unmap_mapping_range+0x59/0x1ff
      	 [<ffffffff8022cc32>] ? __wake_up+0x30/0x44
      	 [<ffffffff8028e286>] ? vmtruncate+0xc2/0xe2
      	 [<ffffffff802b82cf>] ? inode_setattr+0x22/0x10a
      	 [<ffffffffa003baa5>] ? ext3_setattr+0x17b/0x1e6 [ext3]
      	 [<ffffffff802b853d>] ? notify_change+0x186/0x2c9
      	 [<ffffffffa032d9de>] ? cachefiles_attr_changed+0x133/0x1cd [cachefiles]
      	 [<ffffffffa032df7f>] ? cachefiles_lookup_object+0xcf/0x12a [cachefiles]
      	 [<ffffffffa0318165>] ? fscache_lookup_object+0x110/0x122 [fscache]
      	 [<ffffffffa03188c3>] ? fscache_object_slow_work_execute+0x590/0x6bc
      	[fscache]
      	 [<ffffffff80278f82>] ? slow_work_thread+0x285/0x43a
      	 [<ffffffff8024a446>] ? autoremove_wake_function+0x0/0x2e
      	 [<ffffffff80278cfd>] ? slow_work_thread+0x0/0x43a
      	 [<ffffffff8024a317>] ? kthread+0x54/0x81
      	 [<ffffffff8020c93a>] ? child_rip+0xa/0x20
      	 [<ffffffff8024a2c3>] ? kthread+0x0/0x81
      	 [<ffffffff8020c930>] ? child_rip+0x0/0x20
      	CacheFiles: I/O Error: Readpage failed on backing file 200000000000810
      	FS-Cache: Cache cachefiles stopped due to I/O error
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Reported-by: NTakashi Iwai <tiwai@suse.de>
      Reported-by: NDuc Le Minh <duclm.vn@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5e929b33
    • D
      CacheFiles: Don't write a full page if there's only a partial page to cache · a17754fb
      David Howells 提交于
      cachefiles_write_page() writes a full page to the backing file for the last
      page of the netfs file, even if the netfs file's last page is only a partial
      page.
      
      This causes the EOF on the backing file to be extended beyond the EOF of the
      netfs, and thus the backing file will be truncated by cachefiles_attr_changed()
      called from cachefiles_lookup_object().
      
      So we need to limit the write we make to the backing file on that last page
      such that it doesn't push the EOF too far.
      
      Also, if a backing file that has a partial page at the end is expanded, we
      discard the partial page and refetch it on the basis that we then have a hole
      in the file with invalid data, and should the power go out...  A better way to
      deal with this could be to record a note that the partial page contains invalid
      data until the correct data is written into it.
      
      This isn't a problem for netfs's that discard the whole backing file if the
      file size changes (such as NFS).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a17754fb
    • D
      FS-Cache: Allow the current state of all objects to be dumped · 4fbf4291
      David Howells 提交于
      Allow the current state of all fscache objects to be dumped by doing:
      
      	cat /proc/fs/fscache/objects
      
      By default, all objects and all fields will be shown.  This can be restricted
      by adding a suitable key to one of the caller's keyrings (such as the session
      keyring):
      
      	keyctl add user fscache:objlist "<restrictions>" @s
      
      The <restrictions> are:
      
      	K	Show hexdump of object key (don't show if not given)
      	A	Show hexdump of object aux data (don't show if not given)
      
      And paired restrictions:
      
      	C	Show objects that have a cookie
      	c	Show objects that don't have a cookie
      	B	Show objects that are busy
      	b	Show objects that aren't busy
      	W	Show objects that have pending writes
      	w	Show objects that don't have pending writes
      	R	Show objects that have outstanding reads
      	r	Show objects that don't have outstanding reads
      	S	Show objects that have slow work queued
      	s	Show objects that don't have slow work queued
      
      If neither side of a restriction pair is given, then both are implied.  For
      example:
      
      	keyctl add user fscache:objlist KB @s
      
      shows objects that are busy, and lists their object keys, but does not dump
      their auxiliary data.  It also implies "CcWwRrSs", but as 'B' is given, 'b' is
      not implied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4fbf4291
  23. 03 4月, 2009 1 次提交
    • D
      CacheFiles: A cache that backs onto a mounted filesystem · 9ae326a6
      David Howells 提交于
      Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
      backing store for the cache.
      
      CacheFiles uses a userspace daemon to do some of the cache management - such as
      reaping stale nodes and culling.  This is called cachefilesd and lives in
      /sbin.  The source for the daemon can be downloaded from:
      
      	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c
      
      And an example configuration from:
      
      	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf
      
      The filesystem and data integrity of the cache are only as good as those of the
      filesystem providing the backing services.  Note that CacheFiles does not
      attempt to journal anything since the journalling interfaces of the various
      filesystems are very specific in nature.
      
      CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
      to communication with the daemon.  Only one thing may have this open at once,
      and whilst it is open, a cache is at least partially in existence.  The daemon
      opens this and sends commands down it to control the cache.
      
      CacheFiles is currently limited to a single cache.
      
      CacheFiles attempts to maintain at least a certain percentage of free space on
      the filesystem, shrinking the cache by culling the objects it contains to make
      space if necessary - see the "Cache Culling" section.  This means it can be
      placed on the same medium as a live set of data, and will expand to make use of
      spare space and automatically contract when the set of data requires more
      space.
      
      ============
      REQUIREMENTS
      ============
      
      The use of CacheFiles and its daemon requires the following features to be
      available in the system and in the cache filesystem:
      
      	- dnotify.
      
      	- extended attributes (xattrs).
      
      	- openat() and friends.
      
      	- bmap() support on files in the filesystem (FIBMAP ioctl).
      
      	- The use of bmap() to detect a partial page at the end of the file.
      
      It is strongly recommended that the "dir_index" option is enabled on Ext3
      filesystems being used as a cache.
      
      =============
      CONFIGURATION
      =============
      
      The cache is configured by a script in /etc/cachefilesd.conf.  These commands
      set up cache ready for use.  The following script commands are available:
      
       (*) brun <N>%
       (*) bcull <N>%
       (*) bstop <N>%
       (*) frun <N>%
       (*) fcull <N>%
       (*) fstop <N>%
      
      	Configure the culling limits.  Optional.  See the section on culling
      	The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
      
      	The commands beginning with a 'b' are file space (block) limits, those
      	beginning with an 'f' are file count limits.
      
       (*) dir <path>
      
      	Specify the directory containing the root of the cache.  Mandatory.
      
       (*) tag <name>
      
      	Specify a tag to FS-Cache to use in distinguishing multiple caches.
      	Optional.  The default is "CacheFiles".
      
       (*) debug <mask>
      
      	Specify a numeric bitmask to control debugging in the kernel module.
      	Optional.  The default is zero (all off).  The following values can be
      	OR'd into the mask to collect various information:
      
      		1	Turn on trace of function entry (_enter() macros)
      		2	Turn on trace of function exit (_leave() macros)
      		4	Turn on trace of internal debug points (_debug())
      
      	This mask can also be set through sysfs, eg:
      
      		echo 5 >/sys/modules/cachefiles/parameters/debug
      
      ==================
      STARTING THE CACHE
      ==================
      
      The cache is started by running the daemon.  The daemon opens the cache device,
      configures the cache and tells it to begin caching.  At that point the cache
      binds to fscache and the cache becomes live.
      
      The daemon is run as follows:
      
      	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
      
      The flags are:
      
       (*) -d
      
      	Increase the debugging level.  This can be specified multiple times and
      	is cumulative with itself.
      
       (*) -s
      
      	Send messages to stderr instead of syslog.
      
       (*) -n
      
      	Don't daemonise and go into background.
      
       (*) -f <configfile>
      
      	Use an alternative configuration file rather than the default one.
      
      ===============
      THINGS TO AVOID
      ===============
      
      Do not mount other things within the cache as this will cause problems.  The
      kernel module contains its own very cut-down path walking facility that ignores
      mountpoints, but the daemon can't avoid them.
      
      Do not create, rename or unlink files and directories in the cache whilst the
      cache is active, as this may cause the state to become uncertain.
      
      Renaming files in the cache might make objects appear to be other objects (the
      filename is part of the lookup key).
      
      Do not change or remove the extended attributes attached to cache files by the
      cache as this will cause the cache state management to get confused.
      
      Do not create files or directories in the cache, lest the cache get confused or
      serve incorrect data.
      
      Do not chmod files in the cache.  The module creates things with minimal
      permissions to prevent random users being able to access them directly.
      
      =============
      CACHE CULLING
      =============
      
      The cache may need culling occasionally to make space.  This involves
      discarding objects from the cache that have been used less recently than
      anything else.  Culling is based on the access time of data objects.  Empty
      directories are culled if not in use.
      
      Cache culling is done on the basis of the percentage of blocks and the
      percentage of files available in the underlying filesystem.  There are six
      "limits":
      
       (*) brun
       (*) frun
      
           If the amount of free space and the number of available files in the cache
           rises above both these limits, then culling is turned off.
      
       (*) bcull
       (*) fcull
      
           If the amount of available space or the number of available files in the
           cache falls below either of these limits, then culling is started.
      
       (*) bstop
       (*) fstop
      
           If the amount of available space or the number of available files in the
           cache falls below either of these limits, then no further allocation of
           disk space or files is permitted until culling has raised things above
           these limits again.
      
      These must be configured thusly:
      
      	0 <= bstop < bcull < brun < 100
      	0 <= fstop < fcull < frun < 100
      
      Note that these are percentages of available space and available files, and do
      _not_ appear as 100 minus the percentage displayed by the "df" program.
      
      The userspace daemon scans the cache to build up a table of cullable objects.
      These are then culled in least recently used order.  A new scan of the cache is
      started as soon as space is made in the table.  Objects will be skipped if
      their atimes have changed or if the kernel module says it is still using them.
      
      ===============
      CACHE STRUCTURE
      ===============
      
      The CacheFiles module will create two directories in the directory it was
      given:
      
       (*) cache/
      
       (*) graveyard/
      
      The active cache objects all reside in the first directory.  The CacheFiles
      kernel module moves any retired or culled objects that it can't simply unlink
      to the graveyard from which the daemon will actually delete them.
      
      The daemon uses dnotify to monitor the graveyard directory, and will delete
      anything that appears therein.
      
      The module represents index objects as directories with the filename "I..." or
      "J...".  Note that the "cache/" directory is itself a special index.
      
      Data objects are represented as files if they have no children, or directories
      if they do.  Their filenames all begin "D..." or "E...".  If represented as a
      directory, data objects will have a file in the directory called "data" that
      actually holds the data.
      
      Special objects are similar to data objects, except their filenames begin
      "S..." or "T...".
      
      If an object has children, then it will be represented as a directory.
      Immediately in the representative directory are a collection of directories
      named for hash values of the child object keys with an '@' prepended.  Into
      this directory, if possible, will be placed the representations of the child
      objects:
      
      	INDEX     INDEX      INDEX                             DATA FILES
      	========= ========== ================================= ================
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
      
      If the key is so long that it exceeds NAME_MAX with the decorations added on to
      it, then it will be cut into pieces, the first few of which will be used to
      make a nest of directories, and the last one of which will be the objects
      inside the last directory.  The names of the intermediate directories will have
      '+' prepended:
      
      	J1223/@23/+xy...z/+kl...m/Epqr
      
      Note that keys are raw data, and not only may they exceed NAME_MAX in size,
      they may also contain things like '/' and NUL characters, and so they may not
      be suitable for turning directly into a filename.
      
      To handle this, CacheFiles will use a suitably printable filename directly and
      "base-64" encode ones that aren't directly suitable.  The two versions of
      object filenames indicate the encoding:
      
      	OBJECT TYPE	PRINTABLE	ENCODED
      	===============	===============	===============
      	Index		"I..."		"J..."
      	Data		"D..."		"E..."
      	Special		"S..."		"T..."
      
      Intermediate directories are always "@" or "+" as appropriate.
      
      Each object in the cache has an extended attribute label that holds the object
      type ID (required to distinguish special objects) and the auxiliary data from
      the netfs.  The latter is used to detect stale objects in the cache and update
      or retire them.
      
      Note that CacheFiles will erase from the cache any file it doesn't recognise or
      any file of an incorrect type (such as a FIFO file or a device file).
      
      ==========================
      SECURITY MODEL AND SELINUX
      ==========================
      
      CacheFiles is implemented to deal properly with the LSM security features of
      the Linux kernel and the SELinux facility.
      
      One of the problems that CacheFiles faces is that it is generally acting on
      behalf of a process, and running in that process's context, and that includes a
      security context that is not appropriate for accessing the cache - either
      because the files in the cache are inaccessible to that process, or because if
      the process creates a file in the cache, that file may be inaccessible to other
      processes.
      
      The way CacheFiles works is to temporarily change the security context (fsuid,
      fsgid and actor security label) that the process acts as - without changing the
      security context of the process when it the target of an operation performed by
      some other process (so signalling and suchlike still work correctly).
      
      When the CacheFiles module is asked to bind to its cache, it:
      
       (1) Finds the security label attached to the root cache directory and uses
           that as the security label with which it will create files.  By default,
           this is:
      
      	cachefiles_var_t
      
       (2) Finds the security label of the process which issued the bind request
           (presumed to be the cachefilesd daemon), which by default will be:
      
      	cachefilesd_t
      
           and asks LSM to supply a security ID as which it should act given the
           daemon's label.  By default, this will be:
      
      	cachefiles_kernel_t
      
           SELinux transitions the daemon's security ID to the module's security ID
           based on a rule of this form in the policy.
      
      	type_transition <daemon's-ID> kernel_t : process <module's-ID>;
      
           For instance:
      
      	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
      
      The module's security ID gives it permission to create, move and remove files
      and directories in the cache, to find and access directories and files in the
      cache, to set and access extended attributes on cache objects, and to read and
      write files in the cache.
      
      The daemon's security ID gives it only a very restricted set of permissions: it
      may scan directories, stat files and erase files and directories.  It may
      not read or write files in the cache, and so it is precluded from accessing the
      data cached therein; nor is it permitted to create new files in the cache.
      
      There are policy source files available in:
      
      	http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
      
      and later versions.  In that tarball, see the files:
      
      	cachefilesd.te
      	cachefilesd.fc
      	cachefilesd.if
      
      They are built and installed directly by the RPM.
      
      If a non-RPM based system is being used, then copy the above files to their own
      directory and run:
      
      	make -f /usr/share/selinux/devel/Makefile
      	semodule -i cachefilesd.pp
      
      You will need checkpolicy and selinux-policy-devel installed prior to the
      build.
      
      By default, the cache is located in /var/fscache, but if it is desirable that
      it should be elsewhere, than either the above policy files must be altered, or
      an auxiliary policy must be installed to label the alternate location of the
      cache.
      
      For instructions on how to add an auxiliary policy to enable the cache to be
      located elsewhere when SELinux is in enforcing mode, please see:
      
      	/usr/share/doc/cachefilesd-*/move-cache.txt
      
      When the cachefilesd rpm is installed; alternatively, the document can be found
      in the sources.
      
      ==================
      A NOTE ON SECURITY
      ==================
      
      CacheFiles makes use of the split security in the task_struct.  It allocates
      its own task_security structure, and redirects current->act_as to point to it
      when it acts on behalf of another process, in that process's context.
      
      The reason it does this is that it calls vfs_mkdir() and suchlike rather than
      bypassing security and calling inode ops directly.  Therefore the VFS and LSM
      may deny the CacheFiles access to the cache data because under some
      circumstances the caching code is running in the security context of whatever
      process issued the original syscall on the netfs.
      
      Furthermore, should CacheFiles create a file or directory, the security
      parameters with that object is created (UID, GID, security label) would be
      derived from that process that issued the system call, thus potentially
      preventing other processes from accessing the cache - including CacheFiles's
      cache management daemon (cachefilesd).
      
      What is required is to temporarily override the security of the process that
      issued the system call.  We can't, however, just do an in-place change of the
      security data as that affects the process as an object, not just as a subject.
      This means it may lose signals or ptrace events for example, and affects what
      the process looks like in /proc.
      
      So CacheFiles makes use of a logical split in the security between the
      objective security (task->sec) and the subjective security (task->act_as).  The
      objective security holds the intrinsic security properties of a process and is
      never overridden.  This is what appears in /proc, and is what is used when a
      process is the target of an operation by some other process (SIGKILL for
      example).
      
      The subjective security holds the active security properties of a process, and
      may be overridden.  This is not seen externally, and is used whan a process
      acts upon another object, for example SIGKILLing another process or opening a
      file.
      
      LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
      for CacheFiles to run in a context of a specific security label, or to create
      files and directories with another security label.
      
      This documentation is added by the patch to:
      
      	Documentation/filesystems/caching/cachefiles.txt
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NDaire Byrne <Daire.Byrne@framestore.com>
      9ae326a6