1. 08 7月, 2011 1 次提交
    • T
      drivers/virt: introduce Freescale hypervisor management driver · 6db71994
      Timur Tabi 提交于
      Add the drivers/virt directory, which houses drivers that support
      virtualization environments, and add the Freescale hypervisor management
      driver.
      
      The Freescale hypervisor management driver provides several services to
      drivers and applications related to the Freescale hypervisor:
      
      1. An ioctl interface for querying and managing partitions
      
      2. A file interface to reading incoming doorbells
      
      3. An interrupt handler for shutting down the partition upon receiving the
         shutdown doorbell from a manager partition
      
      4. A kernel interface for receiving callbacks when a managed partition
         shuts down.
      Signed-off-by: NTimur Tabi <timur@freescale.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      6db71994
  2. 22 6月, 2011 3 次提交
  3. 20 6月, 2011 1 次提交
  4. 18 6月, 2011 1 次提交
    • S
      USB: Fix up URB error codes to reflect implementation. · a9e75863
      Sarah Sharp 提交于
      Documentation/usb/error-codes.txt mentions that urb->status can be set to
      -EXDEV, if the isochronous transfer was not fully completed.  However, in
      practice, EHCI, UHCI, and OHCI all only set -EXDEV in the individual frame
      status, never in the URB status.  Those host controller actually always
      pass in a zero status to usb_hcd_giveback_urb, and rely on the core to set
      the appropriate status value.
      
      The xHCI driver ran into issues with the uvcvideo driver when it tried to
      set -EXDEV in urb->status, because the driver refused to submit URBs, and
      the userspace camera application's video froze.
      
      Clean up the documentation to reflect the actual implementation.
      Signed-off-by: NSarah Sharp <sarah.a.sharp@linux.intel.com>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      a9e75863
  5. 16 6月, 2011 7 次提交
  6. 15 6月, 2011 1 次提交
    • S
      rcu: Use softirq to address performance regression · 09223371
      Shaohua Li 提交于
      Commit a26ac245(rcu: move TREE_RCU from softirq to kthread)
      introduced performance regression. In an AIM7 test, this commit degraded
      performance by about 40%.
      
      The commit runs rcu callbacks in a kthread instead of softirq. We observed
      high rate of context switch which is caused by this. Out test system has
      64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
      which is caused by RCU's per-CPU kthread.  A trace showed that most of
      the time the RCU per-CPU kthread doesn't actually handle any callbacks,
      but instead just does a very small amount of work handling grace periods.
      This means that RCU's per-CPU kthreads are making the scheduler do quite
      a bit of work in order to allow a very small amount of RCU-related
      processing to be done.
      
      Alex Shi's analysis determined that this slowdown is due to lock
      contention within the scheduler.  Unfortunately, as Peter Zijlstra points
      out, the scheduler's real-time semantics require global action, which
      means that this contention is inherent in real-time scheduling.  (Yes,
      perhaps someone will come up with a workaround -- otherwise, -rt is not
      going to do well on large SMP systems -- but this patch will work around
      this issue in the meantime.  And "the meantime" might well be forever.)
      
      This patch therefore re-introduces softirq processing to RCU, but only
      for core RCU work.  RCU callbacks are still executed in kthread context,
      so that only a small amount of RCU work runs in softirq context in the
      common case.  This should minimize ksoftirqd execution, allowing us to
      skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Tested-by: N"Alex,Shi" <alex.shi@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09223371
  7. 09 6月, 2011 1 次提交
  8. 08 6月, 2011 1 次提交
    • A
      usb-storage: redo incorrect reads · 21c13a4f
      Alan Stern 提交于
      Some USB mass-storage devices have bugs that cause them not to handle
      the first READ(10) command they receive correctly.  The Corsair
      Padlock v2 returns completely bogus data for its first read (possibly
      it returns the data in encrypted form even though the device is
      supposed to be unlocked).  The Feiya SD/SDHC card reader fails to
      complete the first READ(10) command after it is plugged in or after a
      new card is inserted, returning a status code that indicates it thinks
      the command was invalid, which prevents the kernel from retrying the
      read.
      
      Since the first read of a new device or a new medium is for the
      partition sector, the kernel is unable to retrieve the device's
      partition table.  Users have to manually issue an "hdparm -z" or
      "blockdev --rereadpt" command before they can access the device.
      
      This patch (as1470) works around the problem.  It adds a new quirk
      flag, US_FL_INVALID_READ10, indicating that the first READ(10) should
      always be retried immediately, as should any failing READ(10) commands
      (provided the preceding READ(10) command succeeded, to avoid getting
      stuck in a loop).  The patch also adds appropriate unusual_devs
      entries containing the new flag.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Tested-by: NSven Geggus <sven-usbst@geggus.net>
      Tested-by: NPaul Hartman <paul.hartman+linux@gmail.com>
      CC: Matthew Dharm <mdharm-usb@one-eyed-alien.net>
      CC: <stable@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      21c13a4f
  9. 01 6月, 2011 1 次提交
    • Y
      intel-iommu: Enable super page (2MiB, 1GiB, etc.) support · 6dd9a7c7
      Youquan Song 提交于
      There are no externally-visible changes with this. In the loop in the
      internal __domain_mapping() function, we simply detect if we are mapping:
        - size >= 2MiB, and
        - virtual address aligned to 2MiB, and
        - physical address aligned to 2MiB, and
        - on hardware that supports superpages.
      
      (and likewise for larger superpages).
      
      We automatically use a superpage for such mappings. We never have to
      worry about *breaking* superpages, since we trust that we will always
      *unmap* the same range that was mapped. So all we need to do is ensure
      that dma_pte_clear_range() will also cope with superpages.
      
      Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
      it can return a PTE at the appropriate level rather than always
      extending the page tables all the way down to level 1. Again, this is
      simplified by the fact that we should never encounter existing small
      pages when we're creating a mapping; any old mapping that used the same
      virtual range will have been entirely removed and its obsolete page
      tables freed.
      
      Provide an 'intel_iommu=sp_off' argument on the command line as a
      chicken bit. Not that it should ever be required.
      
      ==
      
      The original commit seen in the iommu-2.6.git was Youquan's
      implementation (and completion) of my own half-baked code which I'd
      typed into an email. Followed by half a dozen subsequent 'fixes'.
      
      I've taken the unusual step of rewriting history and collapsing the
      original commits in order to keep the main history simpler, and make
      life easier for the people who are going to have to backport this to
      older kernels. And also so I can give it a more coherent commit comment
      which (hopefully) gives a better explanation of what's going on.
      
      The original sequence of commits leading to identical code was:
      
      Youquan Song (3):
            intel-iommu: super page support
            intel-iommu: Fix superpage alignment calculation error
            intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()
      
      David Woodhouse (4):
            intel-iommu: Precalculate superpage support for dmar_domain
            intel-iommu: Fix hardware_largepage_caps()
            intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
            intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      6dd9a7c7
  10. 30 5月, 2011 2 次提交
  11. 29 5月, 2011 5 次提交
    • L
      x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param · 5d4c47e0
      Len Brown 提交于
      mwait_idle() is a C1-only idle loop intended to be more efficient
      than HLT on SMP hardware that supports it.
      
      But mwait_idle() has been replaced by the more general
      mwait_idle_with_hints(), which handles both C1 and deeper C-states.
      ACPI uses only mwait_idle_with_hints(), and never uses mwait_idle().
      
      Deprecate mwait_idle() and the "idle=mwait" cmdline param
      to simplify the x86 idle code.
      
      After this change, kernels configured with
      (!CONFIG_ACPI=n && !CONFIG_INTEL_IDLE=n) when run on hardware
      that support MWAIT will simply use HLT.  If MWAIT is desired
      on those systems, cpuidle and the cpuidle drivers above
      can be used.
      
      cc: x86@kernel.org
      cc: stable@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      5d4c47e0
    • L
      x86 idle: deprecate "no-hlt" cmdline param · cdaab4a0
      Len Brown 提交于
      We'd rather that modern machines not check if HLT works on
      every entry into idle, for the benefit of machines that had
      marginal electricals 15-years ago.  If those machines are still running
      the upstream kernel, they can use "idle=poll".  The only difference
      will be that they'll now invoke HLT in machine_hlt().
      
      cc: x86@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      cdaab4a0
    • L
      x86 idle APM: deprecate CONFIG_APM_CPU_IDLE · 99c63221
      Len Brown 提交于
      We don't want to export the pm_idle function pointer to modules.
      Currently CONFIG_APM_CPU_IDLE w/ CONFIG_APM_MODULE forces us to.
      
      CONFIG_APM_CPU_IDLE is of dubious value, it runs only on 32-bit
      uniprocessor laptops that are over 10 years old.  It calls into
      the BIOS during idle, and is known to cause a number of machines
      to fail.
      
      Removing CONFIG_APM_CPU_IDLE and will allow us to stop exporting
      pm_idle.  Any systems that were calling into the APM BIOS
      at run-time will simply use HLT instead.
      
      cc: x86@kernel.org
      cc: Jiri Kosina <jkosina@suse.cz>
      cc: stable@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      99c63221
    • L
      x86 idle floppy: deprecate disable_hlt() · 3b70b2e5
      Len Brown 提交于
      Plan to remove floppy_disable_hlt in 2012, an ancient
      workaround with comments that it should be removed.
      
      This allows us to remove clutter and a run-time branch
      from the idle code.
      
      WARN_ONCE() on invocation until it is removed.
      
      cc: x86@kernel.org
      cc: stable@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      3b70b2e5
    • T
      ACPI: Split out custom_method functionality into an own driver · 526b4af4
      Thomas Renninger 提交于
      With /sys/kernel/debug/acpi/custom_method root can write
      to arbitrary memory and increase his priveleges, even if
      these are restricted.
      
      -> Make this an own debug .config option and warn about the
      security issue in the config description.
      
      -> Still keep acpi/debugfs.c which now only creates an empty
         /sys/kernel/debug/acpi directory. There might be other
         users of it later.
      Signed-off-by: NThomas Renninger <trenn@suse.de>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: rui.zhang@intel.com
      Signed-off-by: NLen Brown <len.brown@intel.com>
      526b4af4
  12. 28 5月, 2011 2 次提交
  13. 27 5月, 2011 13 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
    • M
      regulator: Remove supply_regulator_dev from machine configuration · 492c826b
      Mark Brown 提交于
      supply_regulator_dev (using a struct pointer) has been deprecated in favour
      of supply_regulator (using a regulator name) for quite a few releases
      now with a warning generated if it is used and there are no current in tree
      users so just remove the code.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NLiam Girdwood <lrg@slimlogic.co.uk>
      492c826b
    • J
      coredump: add support for exe_file in core name · 57cc083a
      Jiri Slaby 提交于
      Now, exe_file is not proc FS dependent, so we can use it to name core
      file.  So we add %E pattern for core file name cration which extract path
      from mm_struct->exe_file.  Then it converts slashes to exclamation marks
      and pastes the result to the core file name itself.
      
      This is useful for environments where binary names are longer than 16
      character (the current->comm limitation).  Also where there are binaries
      with same name but in a different path.  Further in case the binery itself
      changes its current->comm after exec.
      
      So by doing (s/$/#/ -- # is treated as git comment):
      
        $ sysctl kernel.core_pattern='core.%p.%e.%E'
        $ ln /bin/cat cat45678901234567890
        $ ./cat45678901234567890
        ^Z
        $ rm cat45678901234567890
        $ fg
        ^\Quit (core dumped)
        $ ls core*
      
      we now get:
      
        core.2434.cat456789012345.!root!cat45678901234567890 (deleted)
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57cc083a
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: make procs file writable · 74a1166d
      Ben Blum 提交于
      Make procs file writable to move all threads by tgid at once.
      
      Add functionality that enables users to move all threads in a threadgroup
      at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
      current implementation makes use of a per-threadgroup rwsem that's taken
      for reading in the fork() path to prevent newly forking threads within the
      threadgroup from "escaping" while the move is in progress.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74a1166d
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
    • J
      Documentation: configfs examples crash fix · dcb3a08e
      Jiri Slaby 提交于
      When configfs_register_subsystem() fails, we unregister too many
      subsystems in configfs_example_init.  Decrement i by one to not unregister
      non-registered subsystem.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcb3a08e
    • W
      getdelays: show average CPU/IO/SWAP/RECLAIM delays · 02d54f09
      Wu Fengguang 提交于
      I find it very handy to show the average delays in milliseconds.
      
      Example output (on 100 concurrent dd reading sparse files):
      
        CPU             count     real total  virtual total    delay total  delay average
                          986     3223509952     3207643301    38863410579         39.415ms
        IO              count    delay total  delay average
                            0              0              0ms
        SWAP            count    delay total  delay average
                            0              0              0ms
        RECLAIM         count    delay total  delay average
                         1059     5131834899              4ms
        dd: read=0, write=0, cancelled_write=0
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@linux.vnet.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NSatoru Moriya <satoru.moriya@hds.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02d54f09
    • A
      Documentation/accounting/getdelays.c: handle sendto() failures · 4ed960b1
      Andrew Morton 提交于
      Fixes
      
        Documentation/accounting/getdelays.c: In function `get_family_id':
        Documentation/accounting/getdelays.c:172:14: warning: variable `rc' set but not used [-Wunused-but-set-variable]
      Reported-by: N"Justin P. Mattock" <justinmattock@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ed960b1
    • J
      Documentation/accounting/getdelays.c: fix unused var warning · fbdd91a6
      Justin P. Mattock 提交于
      Fixes
      
        Documentation/accounting/getdelays.c: In function `main':
        Documentation/accounting/getdelays.c:436:7: warning: variable `i' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NJustin P. Mattock <justinmattock@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbdd91a6
    • N
      Documentation/atomic_ops.txt: avoid volatile in sample code · 72eef0f3
      Nikanth Karthikesan 提交于
      As declaring counter as volatile is discouraged, it is best not to use it
      in sample code as well.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72eef0f3
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · 23b5c8fa
      Paul E. McKenney 提交于
      (Note: this was reverted, and is now being re-applied in pieces, with
      this being the fifth and final piece.  See below for the reason that
      it is now felt to be safe to re-apply this.)
      
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      
      This was reverted do to hangs found during testing by Yinghai Lu and
      Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
      located the underlying problem, and Frederic also provided the fix.
      
      The underlying problem was that the HARDIRQ_ENTER() macro from
      lib/locking-selftest.c invoked irq_enter(), which in turn invokes
      rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
      does not invoke rcu_irq_exit().  This situation resulted in calls
      to rcu_irq_enter() that were not balanced by the required calls to
      rcu_irq_exit().  Therefore, after these locking selftests completed,
      RCU's dyntick-idle nesting count was a large number (for example,
      72), which caused RCU to to conclude that the affected CPU was not in
      dyntick-idle mode when in fact it was.
      
      RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
      in hangs.
      
      In contrast, with Frederic's patch, which replaces the irq_enter()
      in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
      either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
      running the test is already marked as not being in dyntick-idle mode.
      This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
      then has no problem working out which CPUs are in dyntick-idle mode and
      which are not.
      
      The reason that the imbalance was not noticed before the barrier patch
      was applied is that the old implementation of rcu_enter_nohz() ignored
      the nesting depth.  This could still result in delays, but much shorter
      ones.  Whenever there was a delay, RCU would IPI the CPU with the
      unbalanced nesting level, which would eventually result in rcu_enter_nohz()
      being called, which in turn would force RCU to see that the CPU was in
      dyntick-idle mode.
      
      The reason that very few people noticed the problem is that the mismatched
      irq_enter() vs. __irq_exit() occured only when the kernel was built with
      CONFIG_DEBUG_LOCKING_API_SELFTESTS.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      23b5c8fa
    • D
      mm/fs: cleancache documentation · 4fe4746a
      Dan Magenheimer 提交于
      This patchset introduces cleancache, an optional new feature exposed
      by the VFS layer that potentially dramatically increases page cache
      effectiveness for many workloads in many environments at a negligible
      cost.  It does this by providing an interface to transcendent memory,
      which is memory/storage that is not otherwise visible to and/or directly
      addressable by the kernel.
      
      Instead of being discarded, hooks in the reclaim code "put" clean
      pages to cleancache.  Filesystems that "opt-in" may "get" pages
      from cleancache that were previously put, but pages in cleancache are
      "ephemeral", meaning they may disappear at any time. And the size
      of cleancache is entirely dynamic and unknowable to the kernel.
      Filesystems currently supported by this patchset include ext3, ext4,
      btrfs, and ocfs2.  Other filesystems (especially those built entirely
      on VFS) should be easy to add, but should first be thoroughly tested to
      ensure coherency.
      
      Details and a FAQ are provided in Documentation/vm/cleancache.txt
      
      This first patch of eight in this cleancache series only adds two
      new documentation files.
      
      [v8: minor documentation changes by author]
      [v3: akpm@linux-foundation.org: document sysfs API]
      [v3: hch@infradead.org: move detailed description to Documentation/vm]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      4fe4746a
  14. 26 5月, 2011 1 次提交