1. 14 9月, 2011 1 次提交
    • O
      iommu/core: Add fault reporting mechanism · 4f3f8d9d
      Ohad Ben-Cohen 提交于
      Add iommu fault report mechanism to the IOMMU API, so implementations
      could report about mmu faults (translation errors, hardware errors,
      etc..).
      
      Fault reports can be used in several ways:
      - mere logging
      - reset the device that accessed the faulting address (may be necessary
        in case the device is a remote processor for example)
      - implement dynamic PTE/TLB loading
      
      A dedicated iommu_set_fault_handler() API has been added to allow
      users, who are interested to receive such reports, to provide
      their handler.
      Signed-off-by: NOhad Ben-Cohen <ohad@wizery.com>
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      4f3f8d9d
  2. 18 8月, 2011 2 次提交
  3. 16 8月, 2011 1 次提交
    • J
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer 提交于
      Commit ae1b1539, block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4853abaa
  4. 14 8月, 2011 2 次提交
  5. 12 8月, 2011 1 次提交
    • V
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov 提交于
      The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
  6. 11 8月, 2011 2 次提交
  7. 10 8月, 2011 1 次提交
  8. 09 8月, 2011 3 次提交
  9. 08 8月, 2011 2 次提交
  10. 07 8月, 2011 5 次提交
    • L
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds 提交于
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • L
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds 提交于
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • D
      net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea
      David S. Miller 提交于
      Computers have become a lot faster since we compromised on the
      partial MD4 hash which we use currently for performance reasons.
      
      MD5 is a much safer choice, and is inline with both RFC1948 and
      other ISS generators (OpenBSD, Solaris, etc.)
      
      Furthermore, only having 24-bits of the sequence number be truly
      unpredictable is a very serious limitation.  So the periodic
      regeneration and 8-bit counter have been removed.  We compute and
      use a full 32-bit sequence number.
      
      For ipv6, DCCP was found to use a 32-bit truncated initial sequence
      number (it needs 43-bits) and that is fixed here as well.
      Reported-by: NDan Kaminsky <dan@doxpara.com>
      Tested-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e5714ea
    • D
      crypto: Move md5_transform to lib/md5.c · bc0b96b5
      David S. Miller 提交于
      We are going to use this for TCP/IP sequence number and fragment ID
      generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc0b96b5
    • M
      lib/sha1: use the git implementation of SHA-1 · 1eb19a12
      Mandeep Singh Baines 提交于
      For ChromiumOS, we use SHA-1 to verify the integrity of the root
      filesystem.  The speed of the kernel sha-1 implementation has a major
      impact on our boot performance.
      
      To improve boot performance, we investigated using the heavily optimized
      sha-1 implementation used in git.  With the git sha-1 implementation, we
      see a 11.7% improvement in boot time.
      
      10 reboots, remove slowest/fastest.
      
      Before:
      
        Mean: 6.58 seconds Stdev: 0.14
      
      After (with git sha-1, this patch):
      
        Mean: 5.89 seconds Stdev: 0.07
      
      The other cool thing about the git SHA-1 implementation is that it only
      needs 64 bytes of stack for the workspace while the original kernel
      implementation needed 320 bytes.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Cc: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
      Cc: Nicolas Pitre <nico@cam.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: linux-crypto@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb19a12
  11. 06 8月, 2011 1 次提交
  12. 05 8月, 2011 1 次提交
  13. 04 8月, 2011 14 次提交
    • G
      Revert "dt: add of_alias_scan and of_alias_get_id" · fe55c184
      Grant Likely 提交于
      This reverts commit 750f463a.
      
      of_alias_* still needs work to be generalized for 'promtree' dt
      platforms, and to no implicitly create entries for available ids.
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      fe55c184
    • H
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins 提交于
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • H
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins 提交于
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • H
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins 提交于
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • H
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins 提交于
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • H
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins 提交于
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
    • H
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins 提交于
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
    • A
      drivers/video/backlight/aat2870_bl.c: fix setting max_current · 5d6f921b
      Axel Lin 提交于
       - Current implementation tests wrong value for setting
         aat2870_bl->max_current.
      
       - In the current implementation, we cannot differentiate between 2 cases:
      
         a) if pdata->max_current is not set , or
      
         b) pdata->max_current is set to AAT2870_CURRENT_0_45 (which is also 0).
      
         Fix it by setting AAT2870_CURRENT_0_45 to be 1 and adjust the equation in
         aat2870_brightness() accordingly.
      Signed-off-by: NAxel Lin <axel.lin@gmail.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Samuel Ortiz <sameo@linux.intel.com>
      Tested-by: NJin Park <jinyoungp@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6f921b
    • J
      mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE · 3dab1bce
      Johannes Weiner 提交于
      __GFP_OTHER_NODE is used for NUMA allocations on behalf of other nodes.
      It's supposed to be passed through from the page allocator to
      zone_statistics(), but it never gets there as gfp_allowed_mask is not
      wide enough and masks out the flag early in the allocation path.
      
      The result is an accounting glitch where successful NUMA allocations
      by-agent are not properly attributed as local.
      
      Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dab1bce
    • R
      ida: simplified functions for id allocation · 88eca020
      Rusty Russell 提交于
      The current hyper-optimized functions are overkill if you simply want to
      allocate an id for a device.  Create versions which use an internal
      lock.
      
      In followup patches, numerous drivers are converted to use this
      interface.
      
      Thanks to Tejun for feedback.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJonathan Cameron <jic23@cam.ac.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88eca020
    • A
      fault-injection: add ability to export fault_attr in arbitrary directory · dd48c085
      Akinobu Mita 提交于
      init_fault_attr_dentries() is used to export fault_attr via debugfs.
      But it can only export it in debugfs root directory.
      
      Per Forlin is working on mmc_fail_request which adds support to inject
      data errors after a completed host transfer in MMC subsystem.
      
      The fault_attr for mmc_fail_request should be defined per mmc host and
      export it in debugfs directory per mmc host like
      /sys/kernel/debug/mmc0/mmc_fail_request.
      
      init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
      introduces fault_create_debugfs_attr() which is able to create a
      directory in the arbitrary directory and replace
      init_fault_attr_dentries().
      
      [akpm@linux-foundation.org: extraneous semicolon, per Randy]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: NPer Forlin <per.forlin@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd48c085
    • L
      cpuidle: stop depending on pm_idle · a0bfa137
      Len Brown 提交于
      cpuidle users should call cpuidle_call_idle() directly
      rather than via (pm_idle)() function pointer.
      
      Architecture may choose to continue using (pm_idle)(),
      but cpuidle need not depend on it:
      
        my_arch_cpu_idle()
      	...
      	if(cpuidle_call_idle())
      		pm_idle();
      
      cc: Kevin Hilman <khilman@deeprootsystems.com>
      cc: Paul Mundt <lethal@linux-sh.org>
      cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      a0bfa137
    • L
      cpuidle: replace xen access to x86 pm_idle and default_idle · d91ee586
      Len Brown 提交于
      When a Xen Dom0 kernel boots on a hypervisor, it gets access
      to the raw-hardware ACPI tables.  While it parses the idle tables
      for the hypervisor's beneift, it uses HLT for its own idle.
      
      Rather than have xen scribble on pm_idle and access default_idle,
      have it simply disable_cpuidle() so acpi_idle will not load and
      architecture default HLT will be used.
      
      cc: xen-devel@lists.xensource.com
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      d91ee586
  14. 03 8月, 2011 4 次提交
    • H
      HWPoison: add memory_failure_queue() · ea8f5fb8
      Huang Ying 提交于
      memory_failure() is the entry point for HWPoison memory error
      recovery.  It must be called in process context.  But commonly
      hardware memory errors are notified via MCE or NMI, so some delayed
      execution mechanism must be used.  In MCE handler, a work queue + ring
      buffer mechanism is used.
      
      In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
      (Generic Hardware Error Source) can be used to report memory errors
      too.  To add support to APEI GHES memory recovery, a mechanism similar
      to that of MCE is implemented.  memory_failure_queue() is the new
      entry point that can be called in IRQ context.  The next step is to
      make MCE handler uses this interface too.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ea8f5fb8
    • H
      lib, Make gen_pool memory allocator lockless · 7f184275
      Huang Ying 提交于
      This version of the gen_pool memory allocator supports lockless
      operation.
      
      This makes it safe to use in NMI handlers and other special
      unblockable contexts that could otherwise deadlock on locks.  This is
      implemented by using atomic operations and retries on any conflicts.
      The disadvantage is that there may be livelocks in extreme cases.  For
      better scalability, one gen_pool allocator can be used for each CPU.
      
      The lockless operation only works if there is enough memory available.
      If new memory is added to the pool a lock has to be still taken.  So
      any user relying on locklessness has to ensure that sufficient memory
      is preallocated.
      
      The basic atomic operation of this allocator is cmpxchg on long.  On
      architectures that don't have NMI-safe cmpxchg implementation, the
      allocator can NOT be used in NMI handler.  So code uses the allocator
      in NMI handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      7f184275
    • H
      lib, Add lock-less NULL terminated single list · f49f23ab
      Huang Ying 提交于
      Cmpxchg is used to implement adding new entry to the list, deleting
      all entries from the list, deleting first entry of the list and some
      other operations.
      
      Because this is a single list, so the tail can not be accessed in O(1).
      
      If there are multiple producers and multiple consumers, llist_add can
      be used in producers and llist_del_all can be used in consumers.  They
      can work simultaneously without lock.  But llist_del_first can not be
      used here.  Because llist_del_first depends on list->first->next does
      not changed if list->first is not changed during its operation, but
      llist_del_first, llist_add, llist_add (or llist_del_all, llist_add,
      llist_add) sequence in another consumer may violate that.
      
      If there are multiple producers and one consumer, llist_add can be
      used in producers and llist_del_all or llist_del_first can be used in
      the consumer.
      
      This can be summarized as follow:
      
                 |   add    | del_first |  del_all
       add       |    -     |     -     |     -
       del_first |          |     L     |     L
       del_all   |          |           |     -
      
      Where "-" stands for no lock is needed, while "L" stands for lock is
      needed.
      
      The list entries deleted via llist_del_all can be traversed with
      traversing function such as llist_for_each etc.  But the list entries
      can not be traversed safely before deleted from the list.  The order
      of deleted entries is from the newest to the oldest added one.  If you
      want to traverse from the oldest to the newest, you must reverse the
      order by yourself before traversing.
      
      The basic atomic operation of this list is cmpxchg on long.  On
      architectures that don't have NMI-safe cmpxchg implementation, the
      list can NOT be used in NMI handler.  So code uses the list in NMI
      handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      f49f23ab
    • S
      dt: add of_alias_scan and of_alias_get_id · 750f463a
      Shawn Guo 提交于
      The patch adds function of_alias_scan to populate a global lookup
      table with the properties of 'aliases' node and function
      of_alias_get_id for drivers to find alias id from the lookup table.
      Signed-off-by: NShawn Guo <shawn.guo@linaro.org>
      [grant.likely: add locking and rework parse loop]
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      750f463a