1. 14 5月, 2014 2 次提交
    • T
      cgroup: implement cftype->write() · b4168640
      Tejun Heo 提交于
      During the recent conversion to kernfs, cftype's seq_file operations
      are updated so that they are directly mapped to kernfs operations and
      thus can fully access the associated kernfs and cgroup contexts;
      however, write path hasn't seen similar updates and none of the
      existing write operations has access to, for example, the associated
      kernfs_open_file.
      
      Let's introduce a new operation cftype->write() which maps directly to
      the kernfs write operation and has access to all the arguments and
      contexts.  This will replace ->write_string() and ->trigger() and ease
      manipulation of kernfs active protection from cgroup file operations.
      
      Two accessors - of_cft() and of_css() - are introduced to enable
      accessing the associated cgroup context from cftype->write() which
      only takes kernfs_open_file for the context information.  The
      accessors for seq_file operations - seq_cft() and seq_css() - are
      rewritten to wrap the of_ accessors.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b4168640
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  2. 13 5月, 2014 1 次提交
    • T
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo 提交于
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5024ae29
  3. 10 5月, 2014 2 次提交
  4. 07 5月, 2014 1 次提交
  5. 05 5月, 2014 3 次提交
    • T
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo 提交于
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      15a4c835
    • T
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo 提交于
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7d699ddb
    • T
      cgroup: make flags and subsys_masks unsigned int · 69dfa00c
      Tejun Heo 提交于
      There's no reason to use atomic bitops for cgroup_subsys_state->flags,
      cgroup_root->flags and various subsys_masks.  This patch updates those
      to use bitwise and/or operations instead and converts them form
      unsigned long to unsigned int.
      
      This makes the fields occupy (marginally) smaller space and makes it
      clear that they don't require atomicity.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      69dfa00c
  6. 26 4月, 2014 3 次提交
    • T
      cgroup: implement cgroup.populated for the default hierarchy · 842b597e
      Tejun Heo 提交于
      cgroup users often need a way to determine when a cgroup's
      subhierarchy becomes empty so that it can be cleaned up.  cgroup
      currently provides release_agent for it; unfortunately, this mechanism
      is riddled with issues.
      
      * It delivers events by forking and execing a userland binary
        specified as the release_agent.  This is a long deprecated method of
        notification delivery.  It's extremely heavy, slow and cumbersome to
        integrate with larger infrastructure.
      
      * There is single monitoring point at the root.  There's no way to
        delegate management of a subtree.
      
      * The event isn't recursive.  It triggers when a cgroup doesn't have
        any tasks or child cgroups.  Events for internal nodes trigger only
        after all children are removed.  This again makes it impossible to
        delegate management of a subtree.
      
      * Events are filtered from the kernel side.  "notify_on_release" file
        is used to subscribe to or suppress release event.  This is
        unnecessarily complicated and probably done this way because event
        delivery itself was expensive.
      
      This patch implements interface file "cgroup.populated" which can be
      used to monitor whether the cgroup's subhierarchy has tasks in it or
      not.  Its value is 0 if there is no task in the cgroup and its
      descendants; otherwise, 1, and kernfs_notify() notificaiton is
      triggers when the value changes, which can be monitored through poll
      and [di]notify.
      
      This is a lot ligther and simpler and trivially allows delegating
      management of subhierarchy - subhierarchy monitoring can block further
      propgation simply by putting itself or another process in the root of
      the subhierarchy and monitor events that it's interested in from there
      without interfering with monitoring higher in the tree.
      
      v2: Patch description updated as per Serge.
      
      v3: "cgroup.subtree_populated" renamed to "cgroup.populated".  The
          subtree_ prefix was a bit confusing because
          "cgroup.subtree_control" uses it to denote the tree rooted at the
          cgroup sans the cgroup itself while the populated state includes
          the cgroup itself.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Lennart Poettering <lennart@poettering.net>
      842b597e
    • M
      kobject: Make support for uevent_helper optional. · 86d56134
      Michael Marineau 提交于
      Support for uevent_helper, aka hotplug, is not required on many systems
      these days but it can still be enabled via sysfs or sysctl.
      Reported-by: NDarren Shepherd <darren.s.shepherd@gmail.com>
      Signed-off-by: NMichael Marineau <mike@marineau.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86d56134
    • T
      kernfs: implement kernfs_root->supers list · 7d568a83
      Tejun Heo 提交于
      Currently, there's no way to find out which super_blocks are
      associated with a given kernfs_root.  Let's implement it - the planned
      inotify extension to kernfs_notify() needs it.
      
      Make kernfs_super_info point back to the super_block and chain it at
      kernfs_root->supers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d568a83
  7. 23 4月, 2014 6 次提交
    • T
      cgroup: implement dynamic subtree controller enable/disable on the default hierarchy · f8f22e53
      Tejun Heo 提交于
      cgroup is switching away from multiple hierarchies and will use one
      unified default hierarchy where controllers can be dynamically enabled
      and disabled per subtree.  The default hierarchy will serve as the
      unified hierarchy to which all controllers are attached and a css on
      the default hierarchy would need to also serve the tasks of descendant
      cgroups which don't have the controller enabled - ie. the tree may be
      collapsed from leaf towards root when viewed from specific
      controllers.  This has been implemented through effective css in the
      previous patches.
      
      This patch finally implements dynamic subtree controller
      enable/disable on the default hierarchy via a new knob -
      "cgroup.subtree_control" which controls which controllers are enabled
      on the child cgroups.  Let's assume a hierarchy like the following.
      
        root - A - B - C
                     \ D
      
      root's "cgroup.subtree_control" determines which controllers are
      enabled on A.  A's on B.  B's on C and D.  This coincides with the
      fact that controllers on the immediate sub-level are used to
      distribute the resources of the parent.  In fact, it's natural to
      assume that resource control knobs of a child belong to its parent.
      Enabling a controller in "cgroup.subtree_control" declares that
      distribution of the respective resources of the cgroup will be
      controlled.  Note that this means that controller enable states are
      shared among siblings.
      
      The default hierarchy has an extra restriction - only cgroups which
      don't contain any task may have controllers enabled in
      "cgroup.subtree_control".  Combined with the other properties of the
      default hierarchy, this guarantees that, from the view point of
      controllers, tasks are only on the leaf cgroups.  In other words, only
      leaf csses may contain tasks.  This rules out situations where child
      cgroups compete against internal tasks of the parent, which is a
      competition between two different types of entities without any clear
      way to determine resource distribution between the two.  Different
      controllers handle it differently and all the implemented behaviors
      are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
      structural constraints imposed from cgroup core removes the burden
      from controller implementations and enables showing one consistent
      behavior across all controllers.
      
      When a controller is enabled or disabled, css associations for the
      controller in the subtrees of each child should be updated.  After
      enabling, the whole subtree of a child should point to the new css of
      the child.  After disabling, the whole subtree of a child should point
      to the cgroup's css.  This is implemented by first updating cgroup
      states such that cgroup_e_css() result points to the appropriate css
      and then invoking cgroup_update_dfl_csses() which migrates all tasks
      in the affected subtrees to the self cgroup on the default hierarchy.
      
      * When read, "cgroup.subtree_control" lists all the currently enabled
        controllers on the children of the cgroup.
      
      * White-space separated list of controller names prefixed with either
        '+' or '-' can be written to "cgroup.subtree_control".  The ones
        prefixed with '+' are enabled on the controller and '-' disabled.
      
      * A controller can be enabled iff the parent's
        "cgroup.subtree_control" enables it and disabled iff no child's
        "cgroup.subtree_control" has it enabled.
      
      * If a cgroup has tasks, no controller can be enabled via
        "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
        some controllers enabled, tasks can't be migrated into the cgroup.
      
      * All controllers which aren't bound on other hierarchies are
        automatically associated with the root cgroup of the default
        hierarchy.  All the controllers which are bound to the default
        hierarchy are listed in the read-only file "cgroup.controllers" in
        the root directory.
      
      * "cgroup.controllers" in all non-root cgroups is read-only file whose
        content is equal to that of "cgroup.subtree_control" of the parent.
        This indicates which controllers can be used in the cgroup's
        "cgroup.subtree_control".
      
      This is still experimental and there are some holes, one of which is
      that ->can_attach() failure during cgroup_update_dfl_csses() may leave
      the cgroups in an undefined state.  The issues will be addressed by
      future patches.
      
      v2: Non-root cgroups now also have "cgroup.controllers".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f8f22e53
    • T
      cgroup: add css_set->dfl_cgrp · 6803c006
      Tejun Heo 提交于
      To implement the unified hierarchy behavior, we'll need to be able to
      determine the associated cgroup on the default hierarchy from css_set.
      Let's add css_set->dfl_cgrp so that it can be accessed conveniently
      and efficiently.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6803c006
    • T
      cgroup: teach css_task_iter about effective csses · 3ebb2b6e
      Tejun Heo 提交于
      Currently, css_task_iter iterates tasks associated with a css by
      visiting each css_set associated with the owning cgroup and walking
      tasks of each of them.  This works fine for !unified hierarchies as
      each cgroup has its own css for each associated subsystem on the
      hierarchy; however, on the planned unified hierarchy, a cgroup may not
      have csses associated and its tasks would be considered associated
      with the matching css of the nearest ancestor which has the subsystem
      enabled.
      
      This means that on the default unified hierarchy, just walking all
      tasks associated with a cgroup isn't enough to walk all tasks which
      are associated with the specified css.  If any of its children doesn't
      have the matching css enabled, task iteration should also include all
      tasks from the subtree.  We already added cgroup->e_csets[] to list
      all css_sets effectively associated with a given css and walk css_sets
      on that list instead to achieve such iteration.
      
      This patch updates css_task_iter iteration such that it walks css_sets
      on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
      requested on an non-dummy css.  Thanks to the previous iteration
      update, this change can be achieved with the addition of
      css_task_iter->ss and minimal updates to css_advance_task_iter() and
      css_task_iter_start().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ebb2b6e
    • T
      cgroup: reorganize css_task_iter · 0f0a2b4f
      Tejun Heo 提交于
      This patch reorganizes css_task_iter so that adding effective css
      support is easier.
      
      * s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency
      
      * ->origin_css is used to determine whether the iteration reached the
        last css_set.  Replace it with explicit ->cset_head so that
        css_advance_task_iter() doesn't have to know the termination
        condition directly.
      
      * css_task_iter_next() currently assumes that it's walking list of
        cgrp_cset_link and reaches into the current cset through the current
        link to determine the termination conditions for task walking.  As
        this won't always be true for effective css walking, add
        ->tasks_head and ->mg_tasks_head and use them to control task
        walking so that css_task_iter_next() doesn't have to know how
        css_sets are being walked.
      
      This patch doesn't make any behavior changes.  The iteration logic
      stays unchanged after the patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0f0a2b4f
    • T
      cgroup: implement cgroup->e_csets[] · 2d8f243a
      Tejun Heo 提交于
      On the default unified hierarchy, a cgroup may be associated with
      csses of its ancestors, which means that a css of a given cgroup may
      be associated with css_sets of descendant cgroups.  This means that we
      can't walk all tasks associated with a css by iterating the css_sets
      associated with the cgroup as there are css_sets which are pointing to
      the css but linked on the descendants.
      
      This patch adds per-subsystem list heads cgroup->e_csets[].  Any
      css_set which is pointing to a css is linked to
      css->cgroup->e_csets[$SUBSYS_ID] through
      css_set->e_cset_node[$SUBSYS_ID].  The lists are protected by
      css_set_rwsem and will allow us to walk all css_sets associated with a
      given css so that we can find out all associated tasks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2d8f243a
    • T
      cgroup: update cgroup->subsys_mask to ->child_subsys_mask and restore cgroup_root->subsys_mask · f392e51c
      Tejun Heo 提交于
      94419627 ("cgroup: move ->subsys_mask from cgroupfs_root to
      cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for
      the unified hierarhcy; however, it turns out that carrying the
      subsys_mask of the children in the parent, instead of itself, is a lot
      more natural.  This patch restores cgroup_root->subsys_mask and morphs
      cgroup->subsys_mask into cgroup->child_subsys_mask.
      
      * Uses of root->cgrp.subsys_mask are restored to root->subsys_mask.
      
      * Remove automatic setting and clearing of cgrp->subsys_mask and
        instead just inherit ->child_subsys_mask from the parent during
        cgroup creation.  Note that this doesn't affect any current
        behaviors.
      
      * Undo __kill_css() separation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f392e51c
  8. 19 4月, 2014 4 次提交
    • M
      mm: use paravirt friendly ops for NUMA hinting ptes · 29c77870
      Mel Gorman 提交于
      David Vrabel identified a regression when using automatic NUMA balancing
      under Xen whereby page table entries were getting corrupted due to the
      use of native PTE operations.  Quoting him
      
      	Xen PV guest page tables require that their entries use machine
      	addresses if the preset bit (_PAGE_PRESENT) is set, and (for
      	successful migration) non-present PTEs must use pseudo-physical
      	addresses.  This is because on migration MFNs in present PTEs are
      	translated to PFNs (canonicalised) so they may be translated back
      	to the new MFN in the destination domain (uncanonicalised).
      
      	pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
      	set and clear the _PAGE_PRESENT bit using pte_set_flags(),
      	pte_clear_flags(), etc.
      
      	In a Xen PV guest, these functions must translate MFNs to PFNs
      	when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
      	_PAGE_PRESENT.
      
      His suggested fix converted p[te|md]_[set|clear]_flags to using
      paravirt-friendly ops but this is overkill.  He suggested an alternative
      of using p[te|md]_modify in the NUMA page table operations but this is
      does more work than necessary and would require looking up a VMA for
      protections.
      
      This patch modifies the NUMA page table operations to use paravirt
      friendly operations to set/clear the flags of interest.  Unfortunately
      this will take a performance hit when updating the PTEs on
      CONFIG_PARAVIRT but I do not see a way around it that does not break
      Xen.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Tested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c77870
    • P
      wait: explain the shadowing and type inconsistencies · 8b32201d
      Peter Zijlstra 提交于
      Stick in a comment before someone else tries to fix the sparse warning
      this generates.
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-o2ro6f3vkxklni0bc8f7m68s@git.kernel.orgSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b32201d
    • V
      Shiraz has moved · 9cc23682
      Viresh Kumar 提交于
      shiraz.hashim@st.com email-id doesn't exist anymore as he has left the
      company.  Replace ST's id with shiraz.linux.kernel@gmail.com.
      
      It also updates .mailmap file to fix address for 'git shortlog'.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cc23682
    • V
      net: sctp: cache auth_enable per endpoint · b14878cc
      Vlad Yasevich 提交于
      Currently, it is possible to create an SCTP socket, then switch
      auth_enable via sysctl setting to 1 and crash the system on connect:
      
      Oops[#1]:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
      task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
      [...]
      Call Trace:
      [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80
      [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4
      [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c
      [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8
      [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214
      [<ffffffff8043af68>] sctp_rcv+0x588/0x630
      [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24
      [<ffffffff803acb50>] ip6_input+0x2c0/0x440
      [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564
      [<ffffffff80310650>] process_backlog+0xb4/0x18c
      [<ffffffff80313cbc>] net_rx_action+0x12c/0x210
      [<ffffffff80034254>] __do_softirq+0x17c/0x2ac
      [<ffffffff800345e0>] irq_exit+0x54/0xb0
      [<ffffffff800075a4>] ret_from_irq+0x0/0x4
      [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48
      [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148
      [<ffffffff805a88b0>] start_kernel+0x37c/0x398
      Code: dd0900b8  000330f8  0126302d <dcc60000> 50c0fff1  0047182a  a48306a0
      03e00008  00000000
      ---[ end trace b530b0551467f2fd ]---
      Kernel panic - not syncing: Fatal exception in interrupt
      
      What happens while auth_enable=0 in that case is, that
      ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
      when endpoint is being created.
      
      After that point, if an admin switches over to auth_enable=1,
      the machine can crash due to NULL pointer dereference during
      reception of an INIT chunk. When we enter sctp_process_init()
      via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
      the INIT verification succeeds and while we walk and process
      all INIT params via sctp_process_param() we find that
      net->sctp.auth_enable is set, therefore do not fall through,
      but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
      dereference what we have set to NULL during endpoint
      initialization phase.
      
      The fix is to make auth_enable immutable by caching its value
      during endpoint initialization, so that its original value is
      being carried along until destruction. The bug seems to originate
      from the very first days.
      
      Fix in joint work with Daniel Borkmann.
      Reported-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Tested-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b14878cc
  9. 18 4月, 2014 4 次提交
    • A
      of: add empty of_find_node_by_path() for !OF · 20cd477c
      Alexander Shiyan 提交于
      Add an empty version of of_find_node_by_path().
      This fixes following build error for asoc tree:
      sound/soc/fsl/fsl_ssi.c: In function 'fsl_ssi_probe':
      sound/soc/fsl/fsl_ssi.c:1471:2: error: implicit declaration of function 'of_find_node_by_path' [-Werror=implicit-function-declaration]
        sprop = of_get_property(of_find_node_by_path("/"), "compatible", NULL);
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAlexander Shiyan <shc_work@mail.ru>
      Signed-off-by: NRob Herring <robh@kernel.org>
      20cd477c
    • D
      drm: Split out drm_probe_helper.c from drm_crtc_helper.c · 8d754544
      Daniel Vetter 提交于
      This is leftover stuff from my previous doc round which I kinda wanted
      to do but didn't yet due to rebase hell.
      
      The modeset helpers and the probing helpers a independent and e.g.
      i915 uses the probing stuff but has its own modeset infrastructure. It
      hence makes to split this up. While at it add a DOC: comment for the
      probing libraray.
      
      It would be rather neat to pull some of the DocBook documenting these
      two helpers into in-line DOC: comments. But unfortunately kerneldoc
      doesn't support markdown or something similar to make nice-looking
      documentation, so the current state is better.
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      8d754544
    • C
      ipmi: boolify some things · 7aefac26
      Corey Minyard 提交于
      Convert some ints to bools.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aefac26
    • C
      ipmi: Turn off all activity on an idle ipmi interface · 89986496
      Corey Minyard 提交于
      The IPMI driver would wake up periodically looking for events and
      watchdog pretimeouts.  If there is nothing waiting for these events,
      it's really kind of pointless to be checking for them.  So modify the
      driver so the message handler can pass down if it needs the lower layer
      to be waiting for these.  Modify the system interface lower layer to
      turn off all timer and thread activity if the upper layer doesn't need
      anything and it is not currently handling messages.  And modify the
      message handler to not restart the timer if its timer is not needed.
      
      The timers and kthread will still be enabled if:
       - the SI interface is handling a message.
       - a user has enabled watching for events.
       - the IPMI watchdog timer is in use (since it uses pretimeouts).
       - the message handler is waiting on a remote response.
       - a user has registered to receive commands.
      
      This mostly affects interfaces without interrupts.  Interfaces with
      interrupts already don't use CPU in the system interface when the
      interface is idle.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89986496
  10. 17 4月, 2014 7 次提交
  11. 16 4月, 2014 5 次提交
    • T
      drm/tegra: Remove gratuitous pad field · cbfbbabb
      Thierry Reding 提交于
      The version of the drm_tegra_submit structure that was merged all the
      way back in 3.10 contains a pad field that was originally intended to
      properly pad the following __u64 field. Unfortunately it seems like a
      different field was dropped during review that caused this padding to
      become unnecessary, but the pad field wasn't removed at that time.
      
      One possible side-effect of this is that since the __u64 following the
      pad is now no longer properly aligned, the compiler may (or may not)
      introduce padding itself, which results in no predictable ABI.
      
      Rectify this by removing the pad field so that all fields are again
      naturally aligned. Technically this is breaking existing userspace ABI,
      but given that there aren't any (released) userspace drivers that make
      use of this yet, the fallout should be minimal.
      
      Fixes: d43f81cb ("drm/tegra: Add gr2d device")
      Cc: <stable@vger.kernel.org> # 3.10
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      cbfbbabb
    • I
      x86: Remove the PCI reboot method from the default chain · 5be44a6f
      Ingo Molnar 提交于
      Steve reported a reboot hang and bisected it back to this commit:
      
        a4f1987e x86, reboot: Add EFI and CF9 reboot methods into the default list
      
      He heroically tested all reboot methods and found the following:
      
        reboot=t       # triple fault                  ok
        reboot=k       # keyboard ctrl                 FAIL
        reboot=b       # BIOS                          ok
        reboot=a       # ACPI                          FAIL
        reboot=e       # EFI                           FAIL   [system has no EFI]
        reboot=p       # PCI 0xcf9                     FAIL
      
      And I think it's pretty obvious that we should only try PCI 0xcf9 as a
      last resort - if at all.
      
      The other observation is that (on this box) we should never try
      the PCI reboot method, but close with either the 'triple fault'
      or the 'BIOS' (terminal!) reboot methods.
      
      Thirdly, CF9_COND is a total misnomer - it should be something like
      CF9_SAFE or CF9_CAREFUL, and 'CF9' should be 'CF9_FORCE' ...
      
      So this patch fixes the worst problems:
      
       - it orders the actual reboot logic to follow the reboot ordering
         pattern - it was in a pretty random order before for no good
         reason.
      
       - it fixes the CF9 misnomers and uses BOOT_CF9_FORCE and
         BOOT_CF9_SAFE flags to make the code more obvious.
      
       - it tries the BIOS reboot method before the PCI reboot method.
         (Since 'BIOS' is a terminal reboot method resulting in a hang
          if it does not work, this is essentially equivalent to removing
          the PCI reboot method from the default reboot chain.)
      
       - just for the miraculous possibility of terminal (resulting
         in hang) reboot methods of triple fault or BIOS returning
         without having done their job, there's an ordering between
         them as well.
      Reported-and-bisected-and-tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Li Aubrey <aubrey.li@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Link: http://lkml.kernel.org/r/20140404064120.GB11877@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5be44a6f
    • C
      percpu: Replace __get_cpu_var with this_cpu_ptr · fdb9c293
      Christoph Lameter 提交于
      __this_cpu_ptr is being phased out.  Use raw_cpu_ptr instead which was
      introduced in 3.15-rc1.  One case of using __get_cpu_var in the
      get_cpu_var macro for address calculation was remaining in
      include/linux/percpu.h.
      
      tj: Updated patch description.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fdb9c293
    • E
      ipv4: add a sock pointer to dst->output() path. · aad88724
      Eric Dumazet 提交于
      In the dst->output() path for ipv4, the code assumes the skb it has to
      transmit is attached to an inet socket, specifically via
      ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
      provider of the packet is an AF_PACKET socket.
      
      The dst->output() method gets an additional 'struct sock *sk'
      parameter. This needs a cascade of changes so that this parameter can
      be propagated from vxlan to final consumer.
      
      Fixes: 8f646c92 ("vxlan: keep original skb ownership")
      Reported-by: Nlucien xin <lucien.xin@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aad88724
    • E
      ipv4: add a sock pointer to ip_queue_xmit() · b0270e91
      Eric Dumazet 提交于
      ip_queue_xmit() assumes the skb it has to transmit is attached to an
      inet socket. Commit 31c70d59 ("l2tp: keep original skb ownership")
      changed l2tp to not change skb ownership and thus broke this assumption.
      
      One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
      so that we do not assume skb->sk points to the socket used by l2tp
      tunnel.
      
      Fixes: 31c70d59 ("l2tp: keep original skb ownership")
      Reported-by: NZhan Jianyu <nasa4836@gmail.com>
      Tested-by: NZhan Jianyu <nasa4836@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0270e91
  12. 15 4月, 2014 2 次提交
    • D
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann 提交于
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: NPeter Butler <pbutler@sonusnet.com>
      Reported-by: NDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      362d5204
    • D
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann 提交于
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c482cdc