1. 03 1月, 2013 13 次提交
    • B
      ACPI: Remove unused struct acpi_pci_root.id member · e0ebda2e
      Bjorn Helgaas 提交于
      This member is never initialized and never referenced, so remove it.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e0ebda2e
    • R
      ACPI: Drop ACPI device .bind() and .unbind() callbacks · 3eec5f7a
      Rafael J. Wysocki 提交于
      Drop the .bind() and .unbind() that have no more users from
      struct acpi_device_ops and remove all of the code referring to
      them from drivers/acpi/scan.c.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      3eec5f7a
    • R
      ACPI / PCI: Rework the setup and cleanup of device wakeup · d2e5f0c1
      Rafael J. Wysocki 提交于
      Currently, the ACPI wakeup capability of PCI devices is set up
      in two different places, partially in acpi_pci_bind() where
      runtime wakeup is initialized and partially in
      platform_pci_wakeup_init(), where system wakeup is initialized.
      The cleanup is only done in acpi_pci_unbind() and it only covers
      runtime wakeup.
      
      Use the new .setup() and .cleanup() callbacks in struct acpi_bus_type
      to consolidate that code and do the setup and the cleanup each in one
      place.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      d2e5f0c1
    • R
      ACPI: Add .setup() and .cleanup() callbacks to struct acpi_bus_type · 11909ca1
      Rafael J. Wysocki 提交于
      Add two new callbacks,.setup() and .cleanup(), struct acpi_bus_type
      and modify acpi_platform_notify() to call .setup() after executing
      acpi_bind_one() successfully and acpi_platform_notify_remove() to
      call .cleanup() before running acpi_unbind_one().  This will allow
      the users of struct acpi_bus_type, PCI in particular, to specify
      operations to be executed right after the given device has been
      associated with a companion struct acpi_device and right before
      it's going to be detached from that companion, respectively.
      
      The main motivation is to be able to get rid of acpi_pci_bind()
      and acpi_pci_unbind(), which are horrible horrible stuff.  [In short,
      there are three problems with them: The way they populate the .bind()
      and .unbind() callbacks of ACPI devices is rather less than
      straightforward, they require special hotplug-specific paths to be
      present in the ACPI namespace scanning code and by the time
      acpi_pci_unbind() is called the PCI device object in question may
      not exist any more.]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      11909ca1
    • R
      ACPI: Make acpi_bus_scan() and acpi_bus_add() take only one argument · 0cd6ac52
      Rafael J. Wysocki 提交于
      The callers of acpi_bus_add() usually assume that if it has
      succeeded, then a struct acpi_device object has been attached to
      the handle passed as the first argument.  Unfortunately, however,
      this assumption is wrong, because acpi_bus_scan(), and acpi_bus_add()
      too as a result, may return a pointer to a different struct
      acpi_device object on success (it may be an object corresponding to
      one of the descendant ACPI nodes in the namespace scope below that
      handle).
      
      For this reason, the callers of acpi_bus_add() who care about
      whether or not a struct acpi_device object has been created for
      its first argument need to check that using acpi_bus_get_device()
      anyway, so the second argument of acpi_bus_add() is not really
      useful for them.  The same observation applies to acpi_bus_scan()
      executed directly from acpi_scan_init().
      
      Therefore modify the relevant callers of acpi_bus_add() to check the
      existence of the struct acpi_device in question with the help of
      acpi_bus_get_device() and drop the no longer necessary second
      argument of acpi_bus_add().  Accordingly, modify acpi_scan_init() to
      use acpi_bus_get_device() to get acpi_root and drop the no longer
      needed second argument of acpi_bus_scan().
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      0cd6ac52
    • R
      ACPI: Replace ACPI device add_type field with a match_driver flag · 209d3b17
      Rafael J. Wysocki 提交于
      After the removal of the second argument of acpi_bus_scan() there is
      no difference between the ACPI_BUS_ADD_MATCH and ACPI_BUS_ADD_START
      add types, so the add_type field in struct acpi_device may be
      replaced with a single flag.  Do that calling the flag match_driver.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      209d3b17
    • R
      ACPI: Remove the arguments of acpi_bus_add() that are not used · 636458de
      Rafael J. Wysocki 提交于
      Notice that acpi_bus_add() uses only 2 of its 4 arguments and
      redefine its header to match the body.  Update all of its callers as
      necessary and observe that this leads to quite a number of removed
      lines of code (Linus will like that).
      
      Add a kerneldoc comment documenting acpi_bus_add() and wonder how
      its callers make wrong assumptions about the second argument (make
      note to self to take care of that later).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      636458de
    • R
      ACPI: Remove acpi_start_single_object() and acpi_bus_start() · 02f57c67
      Rafael J. Wysocki 提交于
      The ACPI PCI root bridge driver was the only ACPI driver implementing
      the .start() callback, which isn't used by any ACPI drivers any more
      now.
      
      For this reason, acpi_start_single_object() has no purpose any more,
      so remove it and all references to it.  Also remove
      acpi_bus_start_device(), whose only purpose was to call
      acpi_start_single_object().
      
      Moreover, since after the removal of acpi_bus_start_device() the
      only purpose of acpi_bus_start() remains to call
      acpi_update_all_gpes(), move that into acpi_bus_add() and drop
      acpi_bus_start() too, remove its header from acpi_bus.h and
      update all of its former users accordingly.
      
      This change was previously proposed in a different from by
      Yinghai Lu.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      02f57c67
    • R
      ACPI: Replace struct acpi_bus_ops with enum type · a2d06a1a
      Rafael J. Wysocki 提交于
      Notice that one member of struct acpi_bus_ops, acpi_op_add, is not
      used anywhere any more and the relationship between its remaining
      members, acpi_op_match and acpi_op_start, is such that it doesn't
      make sense to set the latter without setting the former at the same
      time.  Therefore, replace struct acpi_bus_ops with new a enum type,
      enum acpi_bus_add_type, with three values, ACPI_BUS_ADD_BASIC,
      ACPI_BUS_ADD_MATCH, ACPI_BUS_ADD_START, corresponding to
      both acpi_op_match and acpi_op_start unset, acpi_op_match set and
      acpi_op_start unset, and both acpi_op_match and acpi_op_start set,
      respectively.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      a2d06a1a
    • R
      ACPI: Separate adding ACPI device objects from probing ACPI drivers · 805d410f
      Rafael J. Wysocki 提交于
      Split the ACPI namespace scanning for devices into two passes, such
      that struct acpi_device objects are registerd in the first pass
      without probing ACPI drivers and the drivers are probed against them
      directly in the second pass.
      
      There are two main reasons for doing that.
      
      First, the ACPI PCI root bridge driver's .add() routine,
      acpi_pci_root_add(), causes struct pci_dev objects to be created for
      all PCI devices under the given root bridge.  Usually, there are
      corresponding ACPI device nodes in the ACPI namespace for some of
      those devices and therefore there should be "companion" struct
      acpi_device objects to attach those struct pci_dev objects to.  These
      struct acpi_device objects should exist when the corresponding
      struct pci_dev objects are created, but that is only guaranteed
      during boot and not during hotplug.  This leads to a number of
      functional differences between the boot and the hotplug cases which
      are not strictly necessary and make the code more complicated.
      
      For example, this forces the ACPI PCI root bridge driver to defer the
      registration of the just created struct pci_dev objects and to use a
      special .start() callback routine, acpi_pci_root_start(), to make
      sure that all of the "companion" struct acpi_device objects will be
      present at PCI devices registration time during hotplug.
      
      If those differences can be eliminated, we will be able to
      consolidate the boot and hotplug code paths for the enumeration and
      registration of PCI devices and to reduce the complexity of that
      code quite a bit.
      
      The second reason is that, in general, it should be possible to
      resolve conflicts of resources assigned by the BIOS to different
      devices represented by ACPI namespace nodes before any drivers bind
      to them and before they are attached to "companion" objects
      representing physical devices (such as struct pci_dev).  However, for
      this purpose we first need to enumerate all ACPI device nodes in the
      given namespace scope.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      805d410f
    • D
      UAPI: Remove empty Kbuild files · 3d33fcc1
      David Howells 提交于
      Empty files can get deleted by the patch program, so remove empty Kbuild
      files and their links from the parent Kbuilds.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d33fcc1
    • M
      mm: mempolicy: Convert shared_policy mutex to spinlock · 42288fe3
      Mel Gorman 提交于
      Sasha was fuzzing with trinity and reported the following problem:
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:269
        in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
        2 locks held by trinity-main/6361:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
         #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
        Pid: 6361, comm: trinity-main Tainted: G        W
        3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
        Call Trace:
          __might_sleep+0x1c3/0x1e0
          mutex_lock_nested+0x29/0x50
          mpol_shared_policy_lookup+0x2e/0x90
          shmem_get_policy+0x2e/0x30
          get_vma_policy+0x5a/0xa0
          mpol_misplaced+0x41/0x1d0
          handle_pte_fault+0x465/0x6a0
      
      This was triggered by a different version of automatic NUMA balancing
      but in theory the current version is vunerable to the same problem.
      
      do_numa_page
        -> numa_migrate_prep
          -> mpol_misplaced
            -> get_vma_policy
              -> shmem_get_policy
      
      It's very unlikely this will happen as shared pages are not marked
      pte_numa -- see the page_mapcount() check in change_pte_range() -- but
      it is possible.
      
      To address this, this patch restores sp->lock as originally implemented
      by Kosaki Motohiro.  In the path where get_vma_policy() is called, it
      should not be calling sp_alloc() so it is not necessary to treat the PTL
      specially.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42288fe3
    • H
      mempolicy: remove arg from mpol_parse_str, mpol_to_str · a7a88b23
      Hugh Dickins 提交于
      Remove the unused argument (formerly no_context) from mpol_parse_str()
      and from mpol_to_str().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a88b23
  2. 27 12月, 2012 4 次提交
  3. 26 12月, 2012 3 次提交
    • E
      pidns: Stop pid allocation when init dies · c876ad76
      Eric W. Biederman 提交于
      Oleg pointed out that in a pid namespace the sequence.
      - pid 1 becomes a zombie
      - setns(thepidns), fork,...
      - reaping pid 1.
      - The injected processes exiting.
      
      Can lead to processes attempting access their child reaper and
      instead following a stale pointer.
      
      That waitpid for init can return before all of the processes in
      the pid namespace have exited is also unfortunate.
      
      Avoid these problems by disabling the allocation of new pids in a pid
      namespace when init dies, instead of when the last process in a pid
      namespace is reaped.
      Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c876ad76
    • J
      ext4: fix deadlock in journal_unmap_buffer() · 53e87268
      Jan Kara 提交于
      We cannot wait for transaction commit in journal_unmap_buffer()
      because we hold page lock which ranks below transaction start.  We
      solve the issue by bailing out of journal_unmap_buffer() and
      jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
      for waiting for transaction commit to finish and try invalidation
      again. Since the issue can happen only for page stradding i_size, it
      is simple enough to manually call jbd2_journal_invalidatepage() for
      such page from ext4_setattr(), check the return value and wait if
      necessary.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      53e87268
    • J
      ext4: split off ext4_journalled_invalidatepage() · 4520fb3c
      Jan Kara 提交于
      In data=journal mode we don't need delalloc or DIO handling in invalidatepage
      and similarly in other modes we don't need the journal handling. So split
      invalidatepage implementations.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4520fb3c
  4. 22 12月, 2012 6 次提交
  5. 21 12月, 2012 14 次提交
    • G
      linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors · c4e18497
      Guenter Roeck 提交于
      Commit 263a523d ("linux/kernel.h: Fix warning seen with W=1 due to
      change in DIV_ROUND_CLOSEST") fixes a warning seen with W=1 due to
      change in DIV_ROUND_CLOSEST.
      
      Unfortunately, the C compiler converts divide operations with unsigned
      divisors to unsigned, even if the dividend is signed and negative (for
      example, -10 / 5U = 858993457).  The C standard says "If one operand has
      unsigned int type, the other operand is converted to unsigned int", so
      the compiler is not to blame.  As a result, DIV_ROUND_CLOSEST(0, 2U) and
      similar operations now return bad values, since the automatic conversion
      of expressions such as "0 - 2U/2" to unsigned was not taken into
      account.
      
      Fix by checking for the divisor variable type when deciding which
      operation to perform.  This fixes DIV_ROUND_CLOSEST(0, 2U), but still
      returns bad values for negative dividends divided by unsigned divisors.
      Mark the latter case as unsupported.
      
      One observed effect of this problem is that the s2c_hwmon driver reports
      a value of 4198403 instead of 0 if the ADC reads 0.
      
      Other impact is unpredictable.  Problem is seen if the divisor is an
      unsigned variable or constant and the dividend is less than (divisor/2).
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Reported-by: NJuergen Beisert <jbe@pengutronix.de>
      Tested-by: NJuergen Beisert <jbe@pengutronix.de>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: <stable@vger.kernel.org>	[3.7.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4e18497
    • K
      exec: do not leave bprm->interp on stack · b66c5984
      Kees Cook 提交于
      If a series of scripts are executed, each triggering module loading via
      unprintable bytes in the script header, kernel stack contents can leak
      into the command line.
      
      Normally execution of binfmt_script and binfmt_misc happens recursively.
      However, when modules are enabled, and unprintable bytes exist in the
      bprm->buf, execution will restart after attempting to load matching
      binfmt modules.  Unfortunately, the logic in binfmt_script and
      binfmt_misc does not expect to get restarted.  They leave bprm->interp
      pointing to their local stack.  This means on restart bprm->interp is
      left pointing into unused stack memory which can then be copied into the
      userspace argv areas.
      
      After additional study, it seems that both recursion and restart remains
      the desirable way to handle exec with scripts, misc, and modules.  As
      such, we need to protect the changes to interp.
      
      This changes the logic to require allocation for any changes to the
      bprm->interp.  To avoid adding a new kmalloc to every exec, the default
      value is left as-is.  Only when passing through binfmt_script or
      binfmt_misc does an allocation take place.
      
      For a proof of concept, see DoTest.sh from:
      
         http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b66c5984
    • J
      vfs: turn is_dir argument to kern_path_create into a lookup_flags arg · 1ac12b4b
      Jeff Layton 提交于
      Where we can pass in LOOKUP_DIRECTORY or LOOKUP_REVAL. Any other flags
      passed in here are currently ignored.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ac12b4b
    • J
      vfs: add a retry_estale helper function to handle retries on ESTALE · b9d6ba94
      Jeff Layton 提交于
      This function is expected to be called from path-based syscalls to help
      them decide whether to try the lookup and call again in the event that
      they got an -ESTALE return back on an earier try.
      
      Currently, we only retry the call once on an ESTALE error, but in the
      event that we decide that that's not enough in the future, we should be
      able to change the logic in this helper without too much effort.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b9d6ba94
    • A
      vfs: Remove useless function prototypes · 47166739
      Alessio Igor Bogani 提交于
      Commit 8e22cc88 removes the (un)lock_super
      function definitions but forgets to remove their prototypes.
      Signed-off-by: NAlessio Igor Bogani <abogani@kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      47166739
    • M
      mm: drop vmtruncate · 7898575f
      Marco Stornelli 提交于
      Removed vmtruncate
      Signed-off-by: NMarco Stornelli <marco.stornelli@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7898575f
    • M
      vfs: drop vmtruncate · d30357f2
      Marco Stornelli 提交于
      Removed vmtruncate
      Signed-off-by: NMarco Stornelli <marco.stornelli@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d30357f2
    • D
      FS-Cache: Mark cancellation of in-progress operation · 1f372dff
      David Howells 提交于
      Mark as cancelled an operation that is in progress rather than pending at the
      time it is cancelled, and call fscache_complete_op() to cancel an operation so
      that blocked ops can be started.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1f372dff
    • D
      FS-Cache: Convert the object event ID #defines into an enum · 36a02de5
      David Howells 提交于
      Convert the fscache_object event IDs from #defines into an enum.  Also add an
      extra label to the enum to carry the event count and redefine the event mask
      in terms of that.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      36a02de5
    • D
      VFS: Make more complete truncate operation available to CacheFiles · a02de960
      David Howells 提交于
      Make a more complete truncate operation available to CacheFiles (including
      security checks and suchlike) so that it can use this to clear invalidated
      cache files.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      a02de960
    • D
      FS-Cache: Provide proper invalidation · ef778e7a
      David Howells 提交于
      Provide a proper invalidation method rather than relying on the netfs retiring
      the cookie it has and getting a new one.  The problem with this is that isn't
      easy for the netfs to make sure that it has completed/cancelled all its
      outstanding storage and retrieval operations on the cookie it is retiring.
      
      Instead, have the cache provide an invalidation method that will cancel or wait
      for all currently outstanding operations before invalidating the cache, and
      will cause new operations to queue up behind that.  Whilst invalidation is in
      progress, some requests will be rejected until the cache can stack a barrier on
      the operation queue to cause new operations to be deferred behind it.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ef778e7a
    • D
      FS-Cache: Fix operation state management and accounting · 9f10523f
      David Howells 提交于
      Fix the state management of internal fscache operations and the accounting of
      what operations are in what states.
      
      This is done by:
      
       (1) Give struct fscache_operation a enum variable that directly represents the
           state it's currently in, rather than spreading this knowledge over a bunch
           of flags, who's processing the operation at the moment and whether it is
           queued or not.
      
           This makes it easier to write assertions to check the state at various
           points and to prevent invalid state transitions.
      
       (2) Add an 'operation complete' state and supply a function to indicate the
           completion of an operation (fscache_op_complete()) and make things call
           it.  The final call to fscache_put_operation() can then check that an op
           in the appropriate state (complete or cancelled).
      
       (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
           govern the state of an object:
      
      	(a) The ->n_ops is now the number of extant operations on the object
      	    and is now decremented by fscache_put_operation() only.
      
      	(b) The ->n_in_progress is simply the number of objects that have been
      	    taken off of the object's pending queue for the purposes of being
      	    run.  This is decremented by fscache_op_complete() only.
      
      	(c) The ->n_exclusive is the number of exclusive ops that have been
      	    submitted and queued or are in progress.  It is decremented by
      	    fscache_op_complete() and by fscache_cancel_op().
      
           fscache_put_operation() and fscache_operation_gc() now no longer try to
           clean up ->n_exclusive and ->n_in_progress.  That was leading to double
           decrements against fscache_cancel_op().
      
           fscache_cancel_op() now no longer decrements ->n_ops.  That was leading to
           double decrements against fscache_put_operation().
      
           fscache_submit_exclusive_op() now decides whether it has to queue an op
           based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
           will persist in being true even after all preceding operations have been
           cancelled or completed.  Furthermore, if an object is active and there are
           runnable ops against it, there must be at least one op running.
      
       (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
           provide a function to record completion of the pages as they complete.
      
           When n_pages reaches 0, the operation is deemed to be complete and
           fscache_op_complete() is called.
      
           Add calls to fscache_retrieval_complete() anywhere we've finished with a
           page we've been given to read or allocate for.  This includes places where
           we just return pages to the netfs for reading from the server and where
           accessing the cache fails and we discard the proposed netfs page.
      
      The bugs in the unfixed state management manifest themselves as oopses like the
      following where the operation completion gets out of sync with return of the
      cookie by the netfs.  This is possible because the cache unlocks and returns
      all the netfs pages before recording its completion - which means that there's
      nothing to stop the netfs discarding them and returning the cookie.
      
      
      FS-Cache: Cookie 'NFS.fh' still has outstanding reads
      ------------[ cut here ]------------
      kernel BUG at fs/fscache/cookie.c:519!
      invalid opcode: 0000 [#1] SMP
      CPU 1
      Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
      
      Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090                  /DG965RY
      RIP: 0010:[<ffffffffa007050a>]  [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache]
      RSP: 0018:ffff8800368cfb00  EFLAGS: 00010282
      RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
      RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
      RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
      R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
      R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
      FS:  0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
      Stack:
       ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
       ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
       ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
      Call Trace:
       [<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
       [<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs]
       [<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs]
       [<ffffffff810d8d47>] evict+0xa1/0x15c
       [<ffffffff810d8e2e>] dispose_list+0x2c/0x38
       [<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b
       [<ffffffff810c56b7>] prune_super+0xd5/0x140
       [<ffffffff8109b615>] shrink_slab+0x102/0x1ab
       [<ffffffff8109d690>] balance_pgdat+0x2f2/0x595
       [<ffffffff8103e009>] ? process_timeout+0xb/0xb
       [<ffffffff8109dba3>] kswapd+0x270/0x289
       [<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46
       [<ffffffff8109d933>] ? balance_pgdat+0x595/0x595
       [<ffffffff8104bf7a>] kthread+0x7f/0x87
       [<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10
       [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
       [<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe
       [<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53
       [<ffffffff813ad6b0>] ? gs_change+0xb/0xb
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9f10523f
    • D
      FS-Cache: Make cookie relinquishment wait for outstanding reads · ef46ed88
      David Howells 提交于
      Make fscache_relinquish_cookie() log a warning and wait if there are any
      outstanding reads left on the cookie it was given.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ef46ed88
    • D
      CacheFiles: Fix the marking of cached pages · c4d6d8db
      David Howells 提交于
      Under some circumstances CacheFiles defers the marking of pages with PG_fscache
      so that it can take advantage of pagevecs to reduce the number of calls to
      fscache_mark_pages_cached() and the netfs's hook to keep track of this.
      
      There are, however, two problems with this:
      
       (1) It can lead to the PG_fscache mark being applied _after_ the page is set
           PG_uptodate and unlocked (by the call to fscache_end_io()).
      
       (2) CacheFiles's ref on the page is dropped immediately following
           fscache_end_io() - and so may not still be held when the mark is applied.
           This can lead to the page being passed back to the allocator before the
           mark is applied.
      
      Fix this by, where appropriate, marking the page before calling
      fscache_end_io() and releasing the page.  This means that we can't take
      advantage of pagevecs and have to make a separate call for each page to the
      marking routines.
      
      The symptoms of this are Bad Page state errors cropping up under memory
      pressure, for example:
      
      BUG: Bad page state in process tar  pfn:002da
      page:ffffea0000009fb0 count:0 mapcount:0 mapping:          (null) index:0x1447
      page flags: 0x1000(private_2)
      Pid: 4574, comm: tar Tainted: G        W   3.1.0-rc4-fsdevel+ #1064
      Call Trace:
       [<ffffffff8109583c>] ? dump_page+0xb9/0xbe
       [<ffffffff81095916>] bad_page+0xd5/0xea
       [<ffffffff81095d82>] get_page_from_freelist+0x35b/0x46a
       [<ffffffff810961f3>] __alloc_pages_nodemask+0x362/0x662
       [<ffffffff810989da>] __do_page_cache_readahead+0x13a/0x267
       [<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
       [<ffffffff81098d7b>] ra_submit+0x1c/0x20
       [<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
       [<ffffffff81098ee2>] ? ondemand_readahead+0x163/0x29a
       [<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
       [<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
       [<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
       [<ffffffff810c22c4>] do_sync_read+0xba/0xfa
       [<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
       [<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
       [<ffffffff810c29a4>] vfs_read+0xaa/0x13a
       [<ffffffff810c2a79>] sys_read+0x45/0x6c
       [<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b
      
      As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.
      
      Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
      set appropriately showed that sometimes it wasn't.  This led to the discovery
      that sometimes the page has apparently been reclaimed by the time the marker
      got to see it.
      Reported-by: NM. Stevens <m@tippett.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      c4d6d8db