1. 13 11月, 2013 3 次提交
    • R
      vsprintf: check real user/group id for %pK · 312b4e22
      Ryan Mallon 提交于
      Some setuid binaries will allow reading of files which have read
      permission by the real user id.  This is problematic with files which
      use %pK because the file access permission is checked at open() time,
      but the kptr_restrict setting is checked at read() time.  If a setuid
      binary opens a %pK file as an unprivileged user, and then elevates
      permissions before reading the file, then kernel pointer values may be
      leaked.
      
      This happens for example with the setuid pppd application on Ubuntu 12.04:
      
        $ head -1 /proc/kallsyms
        00000000 T startup_32
      
        $ pppd file /proc/kallsyms
        pppd: In file /proc/kallsyms: unrecognized option 'c1000000'
      
      This will only leak the pointer value from the first line, but other
      setuid binaries may leak more information.
      
      Fix this by adding a check that in addition to the current process having
      CAP_SYSLOG, that effective user and group ids are equal to the real ids.
      If a setuid binary reads the contents of a file which uses %pK then the
      pointer values will be printed as NULL if the real user is unprivileged.
      
      Update the sysctl documentation to reflect the changes, and also correct
      the documentation to state the kptr_restrict=0 is the default.
      
      This is a only temporary solution to the issue.  The correct solution is
      to do the permission check at open() time on files, and to replace %pK
      with a function which checks the open() time permission.  %pK uses in
      printk should be removed since no sane permission check can be done, and
      instead protected by using dmesg_restrict.
      Signed-off-by: NRyan Mallon <rmallon@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312b4e22
    • G
      percpu: add test module for various percpu operations · 623fd807
      Greg Thelen 提交于
      Tests various percpu operations.
      
      Enable with CONFIG_PERCPU_TEST=m.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623fd807
    • M
      mm: do not walk all of system memory during show_mem · c78e9363
      Mel Gorman 提交于
      It has been reported on very large machines that show_mem is taking almost
      5 minutes to display information.  This is a serious problem if there is
      an OOM storm.  The bulk of the cost is in show_mem doing a very expensive
      PFN walk to give us the following information
      
        Total RAM:       Also available as totalram_pages
        Highmem pages:   Also available as totalhigh_pages
        Reserved pages:  Can be inferred from the zone structure
        Shared pages:    PFN walk required
        Unshared pages:  PFN walk required
        Quick pages:     Per-cpu walk required
      
      Only the shared/unshared pages requires a full PFN walk but that
      information is useless.  It is also inaccurate as page pins of unshared
      pages would be accounted for as shared.  Even if the information was
      accurate, I'm struggling to think how the shared/unshared information
      could be useful for debugging OOM conditions.  Maybe it was useful before
      rmap existed when reclaiming shared pages was costly but it is less
      relevant today.
      
      The PFN walk could be optimised a bit but why bother as the information is
      useless.  This patch deletes the PFN walker and infers the total RAM,
      highmem and reserved pages count from struct zone.  It omits the
      shared/unshared page usage on the grounds that it is useless.  It also
      corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
      has similar problems to HighMem with respect to lowmem/highmem exhaustion.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78e9363
  2. 07 11月, 2013 1 次提交
    • L
      Revert "sysfs: drop kobj_ns_type handling" · a1212d27
      Linus Torvalds 提交于
      This reverts commit cb26a311.
      
      It mysteriously causes NetworkManager to not find the wireless device
      for me.  As far as I can tell, Tejun *meant* for this commit to not make
      any semantic changes, but there clearly are some.  So revert it, taking
      into account some of the calling convention changes that happened in
      this area in subsequent commits.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1212d27
  3. 01 11月, 2013 1 次提交
    • M
      lib/scatterlist.c: don't flush_kernel_dcache_page on slab page · 3d77b50c
      Ming Lei 提交于
      Commit b1adaf65 ("[SCSI] block: add sg buffer copy helper
      functions") introduces two sg buffer copy helpers, and calls
      flush_kernel_dcache_page() on pages in SG list after these pages are
      written to.
      
      Unfortunately, the commit may introduce a potential bug:
      
       - Before sending some SCSI commands, kmalloc() buffer may be passed to
         block layper, so flush_kernel_dcache_page() can see a slab page
         finally
      
       - According to cachetlb.txt, flush_kernel_dcache_page() is only called
         on "a user page", which surely can't be a slab page.
      
       - ARCH's implementation of flush_kernel_dcache_page() may use page
         mapping information to do optimization so page_mapping() will see the
         slab page, then VM_BUG_ON() is triggered.
      
      Aaro Koskinen reported the bug on ARM/kirkwood when DEBUG_VM is enabled,
      and this patch fixes the bug by adding test of '!PageSlab(miter->page)'
      before calling flush_kernel_dcache_page().
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reported-by: NAaro Koskinen <aaro.koskinen@iki.fi>
      Tested-by: NSimon Baatz <gmbnomis@gmail.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d77b50c
  4. 29 10月, 2013 1 次提交
  5. 17 10月, 2013 2 次提交
  6. 15 10月, 2013 1 次提交
    • S
      GFS2: Use lockref for glocks · e66cf161
      Steven Whitehouse 提交于
      Currently glocks have an atomic reference count and also a spinlock
      which covers various internal fields, such as the state. This intent of
      this patch is to replace the spinlock and the atomic reference count
      with a lockref structure. This contains a spinlock which we can continue
      to use as before, and a reference counter which is used in conjuction
      with the spinlock to replace the previous atomic counter.
      
      As a result of this there are some new rules for reference counting on
      glocks. We need to distinguish between reference count changes under
      gl_spin (which are now just increment or decrement of the new counter,
      provided the count cannot hit zero) and those which are outside of
      gl_spin, but which now take gl_spin internally.
      
      The conversion is relatively straight forward. There is probably some
      further clean up which can be done, but the priority at this stage is to
      make the change in as simple a manner as possible.
      
      A consequence of this change is that the reference count is being
      decoupled from the lru list processing. This should allow future
      adoption of the lru_list code with glocks in due course.
      
      The reason for using the "dead" state and not just relying on 0 being
      the "invalid state" is so that in due course 0 ref counts can be
      allowable. The intent is to eventually be able to remove the ref count
      changes which are currently hidden away in state_change().
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      e66cf161
  7. 12 10月, 2013 1 次提交
  8. 11 10月, 2013 1 次提交
  9. 04 10月, 2013 1 次提交
    • T
      kobject: grab an extra reference on kobject->sd to allow duplicate deletes · 26ea12de
      Tejun Heo 提交于
      sysfs currently has a rather weird behavior regarding removals.  A
      directory removal would delete all files directly under it but
      wouldn't recurse into subdirectories, which, while a bit inconsistent,
      seems to make sense at the first glance as each directory is
      supposedly associated with a kobject and each kobject can take care of
      the directory deletion; however, this doesn't really hold as we have
      groups which can be directories without a kobject associated with it
      and require explicit deletions.
      
      We're in the process of separating out sysfs from kboject / driver
      core and want a consistent behavior.  A removal should delete either
      only the specified node or everything under it.  I think it is helpful
      to support recursive atomic removal and later patches will implement
      it.
      
      Such change means that a sysfs_dirent associated with kobject may be
      deleted before the kobject itself is removed if one of its ancestor
      gets removed before it.  As sysfs_remove_dir() puts the base ref, we
      may end up with dangling pointer on descendants.  This can be solved
      by holding an extra reference on the sd from kobject.
      
      Acquire an extra reference on the associated sysfs_dirent on directory
      creation and put it after removal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26ea12de
  10. 28 9月, 2013 3 次提交
  11. 27 9月, 2013 4 次提交
    • J
      kobject: introduce kobj_completion · eee03164
      Jeff Mahoney 提交于
      A common way to handle kobject lifetimes in embedded in objects with
      different lifetime rules is to pair the kobject with a struct completion.
      
      This introduces a kobj_completion structure that can be used in place
      of the pairing, along with several convenience functions for
      initialization, release, and put-and-wait.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eee03164
    • T
      sysfs: drop kobj_ns_type handling · cb26a311
      Tejun Heo 提交于
      The way namespace tags are implemented in sysfs is more complicated
      than necessary.  As each tag is a pointer value and required to be
      non-NULL under a namespace enabled parent, there's no need to record
      separately what type each tag is or where namespace is enabled.
      
      If multiple namespace types are needed, which currently aren't, we can
      simply compare the tag to a set of allowed tags in the superblock
      assuming that the tags, being pointers, won't have the same value
      across multiple types.  Also, whether to filter by namespace tag or
      not can be trivially determined by whether the node has any tagged
      children or not.
      
      This patch rips out kobj_ns_type handling from sysfs.  sysfs no longer
      cares whether specific type of namespace is enabled or not.  If a
      sysfs_dirent has a non-NULL tag, the parent is marked as needing
      namespace filtering and the value is tested against the allowed set of
      tags for the superblock (currently only one but increasing this number
      isn't difficult) and the sysfs_dirent is ignored if it doesn't match.
      
      This removes most kobject namespace knowledge from sysfs proper which
      will enable proper separation and layering of sysfs.  The namespace
      sanity checks in fs/sysfs/dir.c are replaced by the new sanity check
      in kobject_namespace().  As this is the only place ktype->namespace()
      is called for sysfs, this doesn't weaken the sanity check
      significantly.  I omitted converting the sanity check in
      sysfs_do_create_link_sd().  While the check can be shifted to upper
      layer, mistakes there are well contained and should be easily visible
      anyway.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb26a311
    • T
      sysfs: remove ktype->namespace() invocations in directory code · e34ff490
      Tejun Heo 提交于
      For some unrecognizable reason, namespace information is communicated
      to sysfs through ktype->namespace() callback when there's *nothing*
      which needs the use of a callback.  The whole sequence of operations
      is completely synchronous and sysfs operations simply end up calling
      back into the layer which just invoked it in order to find out the
      namespace information, which is completely backwards, obfuscates
      what's going on and unnecessarily tangles two separate layers.
      
      This patch doesn't remove ktype->namespace() but shifts its handling
      to kobject layer.  We probably want to get rid of the callback in the
      long term.
      
      This patch adds an explicit param to sysfs_{create|rename|move}_dir()
      and renames them to sysfs_{create|rename|move}_dir_ns(), respectively.
      ktype->namespace() invocations are moved to the calling sites of the
      above functions.  A new helper kboject_namespace() is introduced which
      directly tests kobj_ns_type_operations->type which should give the
      same result as testing sysfs_fs_type(parent_sd) and returns @kobj's
      namespace tag as necessary.  kobject_namespace() is extern as it will
      be used from another file in the following patches.
      
      This patch should be an equivalent conversion without any functional
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e34ff490
    • E
      sysfs: Allow mounting without CONFIG_NET · 667b4102
      Eric W. Biederman 提交于
      In kobj_ns_current_may_mount the default should be to allow the
      mount.  The test is only for a single kobj_ns_type at a time, and unless
      there is a reason to prevent it the mounting sysfs should be allowed.
      Subsystems that are not registered can't have are not involved so can't
      have a reason to prevent mounting sysfs.
      
      This is a bug-fix to:
          commit 7dc5dbc8
          Author: Eric W. Biederman <ebiederm@xmission.com>
          Date:   Mon Mar 25 20:07:01 2013 -0700
      
              sysfs: Restrict mounting sysfs
      
              Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
              over the net namespace.  The principle here is if you create or have
              capabilities over it you can mount it, otherwise you get to live with
              what other people have mounted.
      
              Instead of testing this with a straight forward ns_capable call,
              perform this check the long and torturous way with kobject helpers,
              this keeps direct knowledge of namespaces out of sysfs, and preserves
              the existing sysfs abstractions.
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      
      That came in via the userns tree during the 3.12 merge window.
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      667b4102
  12. 25 9月, 2013 1 次提交
  13. 21 9月, 2013 2 次提交
  14. 13 9月, 2013 1 次提交
  15. 12 9月, 2013 11 次提交
  16. 10 9月, 2013 1 次提交
    • K
      idr: Percpu ida · 798ab48e
      Kent Overstreet 提交于
      Percpu frontend for allocating ids. With percpu allocation (that works),
      it's impossible to guarantee it will always be possible to allocate all
      nr_tags - typically, some will be stuck on a remote percpu freelist
      where the current job can't get to them.
      
      We do guarantee that it will always be possible to allocate at least
      (nr_tags / 2) tags - this is done by keeping track of which and how many
      cpus have tags on their percpu freelists. On allocation failure if
      enough cpus have tags that there could potentially be (nr_tags / 2) tags
      stuck on remote percpu freelists, we then pick a remote cpu at random to
      steal from.
      
      Note that there's no cpu hotplug notifier - we don't care, because
      steal_tags() will eventually get the down cpu's tags. We _could_ satisfy
      more allocations if we had a notifier - but we'll still meet our
      guarantees and it's absolutely not a correctness issue, so I don't think
      it's worth the extra code.
      
      From akpm:
      
          "It looks OK to me (that's as close as I get to an ack :))
      
      v6 changes:
        - Add #include <linux/cpumask.h> to include/linux/percpu_ida.h to
          make alpha/arc builds happy (Fengguang)
        - Move second (cpu >= nr_cpu_ids) check inside of first check scope
          in steal_tags() (akpm + nab)
      
      v5 changes:
        - Change percpu_ida->cpus_have_tags to cpumask_t (kmo + akpm)
        - Add comment for percpu_ida_cpu->lock + ->nr_free (kmo + akpm)
        - Convert steal_tags() to use cpumask_weight() + cpumask_next() +
          cpumask_first() + cpumask_clear_cpu() (kmo + akpm)
        - Add comment for alloc_global_tags() (kmo + akpm)
        - Convert percpu_ida_alloc() to use cpumask_set_cpu() (kmo + akpm)
        - Convert percpu_ida_free() to use cpumask_set_cpu() (kmo + akpm)
        - Drop percpu_ida->cpus_have_tags allocation in percpu_ida_init()
          (kmo + akpm)
        - Drop percpu_ida->cpus_have_tags kfree in percpu_ida_destroy()
          (kmo + akpm)
        - Add comment for percpu_ida_alloc @ gfp (kmo + akpm)
        - Move to percpu_ida.c + percpu_ida.h (kmo + akpm + nab)
      
      v4 changes:
      
        - Fix tags.c reference in percpu_ida_init (akpm)
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      798ab48e
  17. 08 9月, 2013 2 次提交
    • L
      lockref: add ability to mark lockrefs "dead" · e7d33bb5
      Linus Torvalds 提交于
      The only actual current lockref user (dcache) uses zero reference counts
      even for perfectly live dentries, because it's a cache: there may not be
      any users, but that doesn't mean that we want to throw away the dentry.
      
      At the same time, the dentry cache does have a notion of a truly "dead"
      dentry that we must not even increment the reference count of, because
      we have pruned it and it is not valid.
      
      Currently that distinction is not visible in the lockref itself, and the
      dentry cache validation uses "lockref_get_or_lock()" to either get a new
      reference to a dentry that already had existing references (and thus
      cannot be dead), or get the dentry lock so that we can then verify the
      dentry and increment the reference count under the lock if that
      verification was successful.
      
      That's all somewhat complicated.
      
      This adds the concept of being "dead" to the lockref itself, by simply
      using a count that is negative.  This allows a usage scenario where we
      can increment the refcount of a dentry without having to validate it,
      and pushing the special "we killed it" case into the lockref code.
      
      The dentry code itself doesn't actually use this yet, and it's probably
      too late in the merge window to do that code (the dentry_kill() code
      with its "should I decrement the count" logic really is pretty complex
      code), but let's introduce the concept at the lockref level now.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d33bb5
    • L
      lockref: fix docbook argument names · 44a0cf92
      Linus Torvalds 提交于
      The code got rewritten, but the comments got copied as-is from older
      versions, and as a result the argument name in the comment didn't
      actually match the code any more.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a0cf92
  18. 07 9月, 2013 1 次提交
  19. 05 9月, 2013 1 次提交
    • V
      Kconfig.debug: Add FRAME_POINTER anti-dependency for ARC · cc80ae38
      Vineet Gupta 提交于
      Frame pointer on ARC doesn't serve the conventional purpose of stack
      unwinding due to the typical way ABI designates it's usage.
      Thus it's explicit usage on ARC is discouraged (gcc is free to use it,
      for some tricky stack frames even if -fomit-frame-pointer).
      
      Hence no point enabling it for ARC.
      
      References: http://www.spinics.net/lists/kernel/msg1593937.htmlSigned-off-by: NVineet Gupta <vgupta@synopsys.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Paul E. McKenney" <paul.mckenney@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: linux-kernel@vger.kernel.org
      cc80ae38
  20. 04 9月, 2013 1 次提交
    • A
      add formats for dentry/file pathnames · 4b6ccca7
      Al Viro 提交于
      New formats: %p[dD][234]?.  The next pointer is interpreted as struct dentry *
      or struct file * resp. ('d' => dentry, 'D' => file) and the last component(s)
      of pathname are printed (%pd => just the last one, %pd2 => the last two, etc.)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4b6ccca7