1. 13 1月, 2014 8 次提交
    • D
      sched/deadline: Add SCHED_DEADLINE inheritance logic · 2d3d891d
      Dario Faggioli 提交于
      Some method to deal with rt-mutexes and make sched_dl interact with
      the current PI-coded is needed, raising all but trivial issues, that
      needs (according to us) to be solved with some restructuring of
      the pi-code (i.e., going toward a proxy execution-ish implementation).
      
      This is under development, in the meanwhile, as a temporary solution,
      what this commits does is:
      
       - ensure a pi-lock owner with waiters is never throttled down. Instead,
         when it runs out of runtime, it immediately gets replenished and it's
         deadline is postponed;
      
       - the scheduling parameters (relative deadline and default runtime)
         used for that replenishments --during the whole period it holds the
         pi-lock-- are the ones of the waiting task with earliest deadline.
      
      Acting this way, we provide some kind of boosting to the lock-owner,
      still by using the existing (actually, slightly modified by the previous
      commit) pi-architecture.
      
      We would stress the fact that this is only a surely needed, all but
      clean solution to the problem. In the end it's only a way to re-start
      discussion within the community. So, as always, comments, ideas, rants,
      etc.. are welcome! :-)
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Added !RT_MUTEXES build fix. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d3d891d
    • P
      rtmutex: Turn the plist into an rb-tree · fb00aca4
      Peter Zijlstra 提交于
      Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
      and provide a proper comparison function for -deadline and
      -priority tasks.
      
      This is done mainly because:
       - classical prio field of the plist is just an int, which might
         not be enough for representing a deadline;
       - manipulating such a list would become O(nr_deadline_tasks),
         which might be to much, as the number of -deadline task increases.
      
      Therefore, an rb-tree is used, and tasks are queued in it according
      to the following logic:
       - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
         one with the higher (lower, actually!) prio wins;
       - among a -priority and a -deadline task, the latter always wins;
       - among two -deadline tasks, the one with the earliest deadline
         wins.
      
      Queueing and dequeueing functions are changed accordingly, for both
      the list of a task's pi-waiters and the list of tasks blocked on
      a pi-lock.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-again-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb00aca4
    • D
      sched/deadline: Add latency tracing for SCHED_DEADLINE tasks · af6ace76
      Dario Faggioli 提交于
      It is very likely that systems that wants/needs to use the new
      SCHED_DEADLINE policy also want to have the scheduling latency of
      the -deadline tasks under control.
      
      For this reason a new version of the scheduling wakeup latency,
      called "wakeup_dl", is introduced.
      
      As a consequence of applying this patch there will be three wakeup
      latency tracer:
      
       * "wakeup", that deals with all tasks in the system;
       * "wakeup_rt", that deals with -rt and -deadline tasks only;
       * "wakeup_dl", that deals with -deadline tasks only.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af6ace76
    • H
      sched/deadline: Add period support for SCHED_DEADLINE tasks · 755378a4
      Harald Gustafsson 提交于
      Make it possible to specify a period (different or equal than
      deadline) for -deadline tasks. Relative deadlines (D_i) are used on
      task arrivals to generate new scheduling (absolute) deadlines as "d =
      t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
      = d + P_i" when the budget is zero.
      
      This is in general useful to model (and schedule) tasks that have slow
      activation rates (long periods), but have to be scheduled soon once
      activated (short deadlines).
      Signed-off-by: NHarald Gustafsson <harald.gustafsson@ericsson.com>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      755378a4
    • D
      sched/deadline: Add SCHED_DEADLINE avg_update accounting · 239be4a9
      Dario Faggioli 提交于
      Make the core scheduler and load balancer aware of the load
      produced by -deadline tasks, by updating the moving average
      like for sched_rt.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      239be4a9
    • J
      sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic · 1baca4ce
      Juri Lelli 提交于
      Introduces data structures relevant for implementing dynamic
      migration of -deadline tasks and the logic for checking if
      runqueues are overloaded with -deadline tasks and for choosing
      where a task should migrate, when it is the case.
      
      Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
      be moved among CPUs when necessary. It is also possible to bind a
      task to a (set of) CPU(s), thus restricting its capability of
      migrating, or forbidding migrations at all.
      
      The very same approach used in sched_rt is utilised:
       - -deadline tasks are kept into CPU-specific runqueues,
       - -deadline tasks are migrated among runqueues to achieve the
         following:
          * on an M-CPU system the M earliest deadline ready tasks
            are always running;
          * affinity/cpusets settings of all the -deadline tasks is
            always respected.
      
      Therefore, this very special form of "load balancing" is done with
      an active method, i.e., the scheduler pushes or pulls tasks between
      runqueues when they are woken up and/or (de)scheduled.
      IOW, every time a preemption occurs, the descheduled task might be sent
      to some other CPU (depending on its deadline) to continue executing
      (push). On the other hand, every time a CPU becomes idle, it might pull
      the second earliest deadline ready task from some other CPU.
      
      To enforce this, a pull operation is always attempted before taking any
      scheduling decision (pre_schedule()), as well as a push one after each
      scheduling decision (post_schedule()). In addition, when a task arrives
      or wakes up, the best CPU where to resume it is selected taking into
      account its affinity mask, the system topology, but also its deadline.
      E.g., from the scheduling point of view, the best CPU where to wake
      up (and also where to push) a task is the one which is running the task
      with the latest deadline among the M executing ones.
      
      In order to facilitate these decisions, per-runqueue "caching" of the
      deadlines of the currently running and of the first ready task is used.
      Queued but not running tasks are also parked in another rb-tree to
      speed-up pushes.
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1baca4ce
    • D
      sched/deadline: Add SCHED_DEADLINE structures & implementation · aab03e05
      Dario Faggioli 提交于
      Introduces the data structures, constants and symbols needed for
      SCHED_DEADLINE implementation.
      
      Core data structure of SCHED_DEADLINE are defined, along with their
      initializers. Hooks for checking if a task belong to the new policy
      are also added where they are needed.
      
      Adds a scheduling class, in sched/dl.c and a new policy called
      SCHED_DEADLINE. It is an implementation of the Earliest Deadline
      First (EDF) scheduling algorithm, augmented with a mechanism (called
      Constant Bandwidth Server, CBS) that makes it possible to isolate
      the behaviour of tasks between each other.
      
      The typical -deadline task will be made up of a computation phase
      (instance) which is activated on a periodic or sporadic fashion. The
      expected (maximum) duration of such computation is called the task's
      runtime; the time interval by which each instance need to be completed
      is called the task's relative deadline. The task's absolute deadline
      is dynamically calculated as the time instant a task (better, an
      instance) activates plus the relative deadline.
      
      The EDF algorithms selects the task with the smallest absolute
      deadline as the one to be executed first, while the CBS ensures each
      task to run for at most its runtime every (relative) deadline
      length time interval, avoiding any interference between different
      tasks (bandwidth isolation).
      Thanks to this feature, also tasks that do not strictly comply with
      the computational model sketched above can effectively use the new
      policy.
      
      To summarize, this patch:
       - introduces the data structures, constants and symbols needed;
       - implements the core logic of the scheduling algorithm in the new
         scheduling class file;
       - provides all the glue code between the new scheduling class and
         the core scheduler and refines the interactions between sched/dl
         and the other existing scheduling classes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <michael@amarulasolutions.com>
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aab03e05
    • D
      sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a
      Dario Faggioli 提交于
      Add the syscalls needed for supporting scheduling algorithms
      with extended scheduling parameters (e.g., SCHED_DEADLINE).
      
      In general, it makes possible to specify a periodic/sporadic task,
      that executes for a given amount of runtime at each instance, and is
      scheduled according to the urgency of their own timing constraints,
      i.e.:
      
       - a (maximum/typical) instance execution time,
       - a minimum interval between consecutive instances,
       - a time constraint by which each instance must be completed.
      
      Thus, both the data structure that holds the scheduling parameters of
      the tasks and the system calls dealing with it must be extended.
      Unfortunately, modifying the existing struct sched_param would break
      the ABI and result in potentially serious compatibility issues with
      legacy binaries.
      
      For these reasons, this patch:
      
       - defines the new struct sched_attr, containing all the fields
         that are necessary for specifying a task in the computational
         model described above;
      
       - defines and implements the new scheduling related syscalls that
         manipulate it, i.e., sched_setattr() and sched_getattr().
      
      Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
      proof of concept and for developing and testing purposes. Making them
      available on other architectures is straightforward.
      
      Since no "user" for these new parameters is introduced in this patch,
      the implementation of the new system calls is just identical to their
      already existing counterpart. Future patches that implement scheduling
      policies able to exploit the new data structure must also take care of
      modifying the sched_*attr() calls accordingly with their own purposes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      [ Rewrote to use sched_attr. ]
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Removed sched_setscheduler2() for now. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d50dde5a
  2. 12 1月, 2014 1 次提交
    • R
      sched: Calculate effective load even if local weight is 0 · 9722c2da
      Rik van Riel 提交于
      Thomas Hellstrom bisected a regression where erratic 3D performance is
      experienced on virtual machines as measured by glxgears. It identified
      commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA
      node") as the problem which had modified the behaviour of effective_load.
      
      Effective load calculates the difference to the system-wide load if a
      scheduling entity was moved to another CPU. The task group is not heavier
      as a result of the move but overall system load can increase/decrease as a
      result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs
      on a preferred NUMA node") changed effective_load to make it suitable for
      calculating if a particular NUMA node was compute overloaded. To reduce
      the cost of the function, it assumed that a current sched entity weight
      of 0 was uninteresting but that is not the case.
      
      wake_affine() uses a weight of 0 for sync wakeups on the grounds that it
      is assuming the waking task will sleep and not contribute to load in the
      near future. In this case, we still want to calculate the effective load
      of the sched entity hierarchy. As effective_load is no longer used by
      task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide
      search to find swap/migration candidates), this patch simply restores the
      historical behaviour.
      Reported-and-tested-by: NThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      [ Wrote changelog]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9722c2da
  3. 22 12月, 2013 1 次提交
    • M
      PM / sleep: Fix memory leak in pm_vt_switch_unregister(). · c6068504
      Masami Ichikawa 提交于
      kmemleak reported a memory leak as below.
      
      unreferenced object 0xffff880118f14700 (size 32):
        comm "swapper/0", pid 1, jiffies 4294877401 (age 123.283s)
        hex dump (first 32 bytes):
          00 01 10 00 00 00 ad de 00 02 20 00 00 00 ad de  .......... .....
          00 d4 d2 18 01 88 ff ff 01 00 00 00 00 04 00 00  ................
        backtrace:
          [<ffffffff814edb1e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811889dc>] kmem_cache_alloc_trace+0x1ec/0x260
          [<ffffffff810aba66>] pm_vt_switch_required+0x76/0xb0
          [<ffffffff812f39f5>] register_framebuffer+0x195/0x320
          [<ffffffff8130af18>] efifb_probe+0x718/0x780
          [<ffffffff81391495>] platform_drv_probe+0x45/0xb0
          [<ffffffff8138f407>] driver_probe_device+0x87/0x3a0
          [<ffffffff8138f7f3>] __driver_attach+0x93/0xa0
          [<ffffffff8138d413>] bus_for_each_dev+0x63/0xa0
          [<ffffffff8138ee5e>] driver_attach+0x1e/0x20
          [<ffffffff8138ea40>] bus_add_driver+0x180/0x250
          [<ffffffff8138fe74>] driver_register+0x64/0xf0
          [<ffffffff813913ba>] __platform_driver_register+0x4a/0x50
          [<ffffffff8191e028>] efifb_driver_init+0x12/0x14
          [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
          [<ffffffff818e40e0>] kernel_init_freeable+0x17b/0x201
      
      In pm_vt_switch_required(), "entry" variable is allocated via kmalloc().
      So, in pm_vt_switch_unregister(), it needs to call kfree() when object
      is deleted from list.
      Signed-off-by: NMasami Ichikawa <masami256@gmail.com>
      Reviewed-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c6068504
  4. 21 12月, 2013 1 次提交
  5. 20 12月, 2013 1 次提交
    • T
      libata, freezer: avoid block device removal while system is frozen · 85fbd722
      Tejun Heo 提交于
      Freezable kthreads and workqueues are fundamentally problematic in
      that they effectively introduce a big kernel lock widely used in the
      kernel and have already been the culprit of several deadlock
      scenarios.  This is the latest occurrence.
      
      During resume, libata rescans all the ports and revalidates all
      pre-existing devices.  If it determines that a device has gone
      missing, the device is removed from the system which involves
      invalidating block device and flushing bdi while holding driver core
      layer locks.  Unfortunately, this can race with the rest of device
      resume.  Because freezable kthreads and workqueues are thawed after
      device resume is complete and block device removal depends on
      freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
      progress, this can lead to deadlock - block device removal can't
      proceed because kthreads are frozen and kthreads can't be thawed
      because device resume is blocked behind block device removal.
      
      839a8e86 ("writeback: replace custom worker pool implementation
      with unbound workqueue") made this particular deadlock scenario more
      visible but the underlying problem has always been there - the
      original forker task and jbd2 are freezable too.  In fact, this is
      highly likely just one of many possible deadlock scenarios given that
      freezer behaves as a big kernel lock and we don't have any debug
      mechanism around it.
      
      I believe the right thing to do is getting rid of freezable kthreads
      and workqueues.  This is something fundamentally broken.  For now,
      implement a funny workaround in libata - just avoid doing block device
      hot[un]plug while the system is frozen.  Kernel engineering at its
      finest.  :(
      
      v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
          as a module.
      
      v3: Comment updated and polling interval changed to 10ms as suggested
          by Rafael.
      
      v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
          defined when FREEZER is not configured thus breaking build.
          Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTomaž Šolc <tomaz.solc@tablix.org>
      Reviewed-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
      Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: kbuild test robot <fengguang.wu@intel.com>
      85fbd722
  6. 19 12月, 2013 3 次提交
  7. 17 12月, 2013 8 次提交
  8. 16 12月, 2013 1 次提交
    • M
      ftrace: Initialize the ftrace profiler for each possible cpu · c4602c1c
      Miao Xie 提交于
      Ftrace currently initializes only the online CPUs. This implementation has
      two problems:
      - If we online a CPU after we enable the function profile, and then run the
        test, we will lose the trace information on that CPU.
        Steps to reproduce:
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # echo 1 > /sys/devices/system/cpu/cpu1/online
        # run test
      - If we offline a CPU before we enable the function profile, we will not clear
        the trace information when we enable the function profile. It will trouble
        the users.
        Steps to reproduce:
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # run test
        # cat trace_stat/function*
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # echo 0 > function_profile_enabled
        # echo 1 > function_profile_enabled
        # cat trace_stat/function*
        # run test
        # cat trace_stat/function*
      
      So it is better that we initialize the ftrace profiler for each possible cpu
      every time we enable the function profile instead of just the online ones.
      
      Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c4602c1c
  9. 13 12月, 2013 5 次提交
    • X
      KEYS: fix uninitialized persistent_keyring_register_sem · 6bd364d8
      Xiao Guangrong 提交于
      We run into this bug:
      [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
      [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
      [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
      [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
      [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
      [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
      [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
      [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300   Not tainted  (3.10.0-48.el7.ppc64)
      [ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28824242  XER: 20000000
      [ 2736.063415] SOFTE: 0
      [ 2736.063418] CFAR: c00000000000908c
      [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
      [ 2736.063425]
      GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
      GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
      GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
      GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
      GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
      GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
      [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
      [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063508] PACATMSCRATCH [800000000280f032]
      [ 2736.063511] Call Trace:
      [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
      [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
      [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
      [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
      [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
      [ 2736.063553] Instruction dump:
      [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
      [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
      [ 2736.063579] ---[ end trace 2708241785538296 ]---
      
      It's caused by uninitialized persistent_keyring_register_sem.
      
      The bug was introduced by commit f36f8c75, two typos are in that commit:
      CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
      krb_cache_register_sem should be persistent_keyring_register_sem.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6bd364d8
    • K
      KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y · f46a3cbb
      Kirill Tkhai 提交于
      Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f46a3cbb
    • D
      X.509: Fix certificate gathering · d7ec435f
      David Howells 提交于
      Fix the gathering of certificates from both the source tree and the build tree
      to correctly calculate the pathnames of all the certificates.
      
      The problem was that if the default generated cert, signing_key.x509, didn't
      exist then it would not have a path attached and if it did, it would have a
      path attached.
      
      This means that the contents of kernel/.x509.list would change between the
      first compilation in a directory and the second.  After the second it would
      remain stable because the signing_key.x509 file exists.
      
      The consequence was that the kernel would get relinked unconditionally on the
      second recompilation.  The second recompilation would also show something like
      this:
      
         X.509 certificate list changed
           CERTS   kernel/x509_certificate_list
           - Including cert /home/torvalds/v2.6/linux/signing_key.x509
           AS      kernel/system_certificates.o
           LD      kernel/built-in.o
      
      which is why the relink would happen.
      
      
      Unfortunately, it isn't a simple matter of just sticking a path on the front
      of the filename of the certificate in the build directory as make can't then
      work out how to build it.
      
      So the path has to be prepended to the name for sorting and duplicate
      elimination and then removed for the make rule if it is in the build tree.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d7ec435f
    • L
      futex: move user address verification up to common code · 5cdec2d8
      Linus Torvalds 提交于
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cdec2d8
    • L
      futex: fix handling of read-only-mapped hugepages · f12d5bfc
      Linus Torvalds 提交于
      The hugepage code had the exact same bug that regular pages had in
      commit 7485d0d3 ("futexes: Remove rw parameter from
      get_futex_key()").
      
      The regular page case was fixed by commit 9ea71503 ("futex: Fix
      regression with read only mappings"), but the transparent hugepage case
      (added in a5b338f2: "thp: update futex compound knowledge") case
      remained broken.
      
      Found by Dave Jones and his trinity tool.
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Cc: stable@kernel.org # v2.6.38+
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f12d5bfc
  10. 11 12月, 2013 4 次提交
    • P
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra 提交于
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dbdb155
    • P
      sched: Initialize power_orig for overlapping groups · 8e8339a3
      Peter Zijlstra 提交于
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e8339a3
    • H
      KEYS: correct alignment of system_certificate_list content in assembly file · 62226983
      Hendrik Brueckner 提交于
      Apart from data-type specific alignment constraints, there are also
      architecture-specific alignment requirements.
      For example, on s390 symbols must be on even addresses implying a 2-byte
      alignment.  If the system_certificate_list_end symbol is on an odd address
      and if this address is loaded, the least-significant bit is ignored.  As a
      result, the load_system_certificate_list() fails to load the certificates
      because of a wrong certificate length calculation.
      
      To be safe, align system_certificate_list on an 8-byte boundary.  Also improve
      the length calculation of the system_certificate_list content.  Introduce a
      system_certificate_list_size (8-byte aligned because of unsigned long) variable
      that stores the length.  Let the linker calculate this size by introducing
      a start and end label for the certificate content.
      Signed-off-by: NHendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      62226983
    • R
      Ignore generated file kernel/x509_certificate_list · 7cfe5b33
      Rusty Russell 提交于
      $ git status
      # On branch pending-rebases
      # Untracked files:
      #   (use "git add <file>..." to include in what will be committed)
      #
      #	kernel/x509_certificate_list
      nothing added to commit but untracked files present (use "git add" to track)
      $
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7cfe5b33
  11. 08 12月, 2013 1 次提交
  12. 07 12月, 2013 1 次提交
    • T
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo 提交于
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Reported-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  13. 06 12月, 2013 1 次提交
  14. 05 12月, 2013 1 次提交
  15. 29 11月, 2013 2 次提交
  16. 28 11月, 2013 1 次提交
    • T
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo 提交于
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365