1. 14 2月, 2015 1 次提交
  2. 16 1月, 2015 1 次提交
    • P
      rcu: Optionally run grace-period kthreads at real-time priority · a94844b2
      Paul E. McKenney 提交于
      Recent testing has shown that under heavy load, running RCU's grace-period
      kthreads at real-time priority can improve performance (according to 0day
      test robot) and reduce the incidence of RCU CPU stall warnings.  However,
      most systems do just fine with the default non-realtime priorities for
      these kthreads, and it does not make sense to expose the entire user
      base to any risk stemming from this change, given that this change is
      of use only to a few users running extremely heavy workloads.
      
      Therefore, this commit allows users to specify realtime priorities
      for the grace-period kthreads, but leaves them running SCHED_OTHER
      by default.  The realtime priority may be specified at build time
      via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
      rcutree.kthread_prio parameter.  Either way, 0 says to continue the
      default SCHED_OTHER behavior and values from 1-99 specify that priority
      of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
      the RCU_BOOST Kconfig parameter is specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a94844b2
  3. 07 1月, 2015 2 次提交
    • P
      rcu: Make SRCU optional by using CONFIG_SRCU · 83fe27ea
      Pranith Kumar 提交于
      SRCU is not necessary to be compiled by default in all cases. For tinification
      efforts not compiling SRCU unless necessary is desirable.
      
      The current patch tries to make compiling SRCU optional by introducing a new
      Kconfig option CONFIG_SRCU which is selected when any of the components making
      use of SRCU are selected.
      
      If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
      
         text    data     bss     dec     hex filename
         2007       0       0    2007     7d7 kernel/rcu/srcu.o
      
      Size of arch/powerpc/boot/zImage changes from
      
         text    data     bss     dec     hex filename
       831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
       829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after
      
      so the savings are about ~2000 bytes.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Josh Triplett <josh@joshtriplett.org>
      CC: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
      83fe27ea
    • L
      rcu: Remove "select IRQ_WORK" from config TREE_RCU · 5a43b88e
      Lai Jiangshan 提交于
      The 48a7639c ("rcu: Make callers awaken grace-period kthread")
      removed the irq_work_queue(), so the TREE_RCU doesn't need
      irq work any more.  This commit therefore updates RCU's Kconfig and
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5a43b88e
  4. 11 12月, 2014 6 次提交
    • A
      init: allow CONFIG_INIT_FALLBACK=n to disable defaults if init= fails · 6ef4536e
      Andy Lutomirski 提交于
      If a user puts init=/whatever on the command line and /whatever can't be
      run, then the kernel will try a few default options before giving up.  If
      init=/whatever came from a bootloader prompt, then this is unexpected but
      probably harmless.  On the other hand, if it comes from a script (e.g.  a
      tool like virtme or perhaps a future kselftest script), then the fallbacks
      are likely to exist, but they'll do the wrong thing.  For example, they
      might unexpectedly invoke systemd.
      
      This adds a config option CONFIG_INIT_FALLBACK.  If unset, then a failure
      to run the specified init= process be fatal.
      
      The tentative plan is to remove CONFIG_INIT_FALLBACK for 3.20.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Rob Landley <rob@landley.net>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ef4536e
    • J
      mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea
      Johannes Weiner 提交于
      Now that the external page_cgroup data structure and its lookup is
      gone, let the generic bad_page() check for page->mem_cgroup sanity.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9edad6ea
    • A
      mm/numa balancing: rearrange Kconfig entry · 6f7c97e8
      Aneesh Kumar K.V 提交于
      Add the default enable config option after the NUMA_BALANCING option so
      that it appears related in the nconfig interface.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f7c97e8
    • J
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner 提交于
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • J
      mm: hugetlb_cgroup: convert to lockless page counters · 71f87bee
      Johannes Weiner 提交于
      Abandon the spinlock-protected byte counters in favor of the unlocked
      page counters in the hugetlb controller as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71f87bee
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  5. 30 10月, 2014 2 次提交
  6. 29 10月, 2014 1 次提交
  7. 28 10月, 2014 1 次提交
    • A
      bpf: split eBPF out of NET · f89b7755
      Alexei Starovoitov 提交于
      introduce two configs:
      - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
        depend on
      - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
      
      that solves several problems:
      - tracing and others that wish to use eBPF don't need to depend on NET.
        They can use BPF_SYSCALL to allow loading from userspace or select BPF
        to use it directly from kernel in NET-less configs.
      - in 3.18 programs cannot be attached to events yet, so don't force it on
      - when the rest of eBPF infra is there in 3.19+, it's still useful to
        switch it off to minimize kernel size
      
      bloat-o-meter on x64 shows:
      add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
      
      tested with many different config combinations. Hopefully didn't miss anything.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f89b7755
  8. 14 10月, 2014 1 次提交
  9. 10 10月, 2014 1 次提交
    • M
      mm: remove misleading ARCH_USES_NUMA_PROT_NONE · 6a33979d
      Mel Gorman 提交于
      ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
      _PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
      relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
      fault scanner.  This was found to be conceptually confusing with a lot of
      implicit assumptions and it was asked that an alternative be found.
      
      Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
      PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
      bits and shrunk the maximum possible swap size but it did not go far
      enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
      but the relics still exist.
      
      This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
      duplication in powerpc vs the generic implementation by defining the types
      the core NUMA helpers expected to exist from x86 with their ppc64
      equivalent.  This necessitated that a PTE bit mask be created that
      identified the bits that distinguish present from NUMA pte entries but it
      is expected this will only differ between arches based on _PAGE_PROTNONE.
      The naming for the generic helpers was taken from x86 originally but ppc64
      has types that are equivalent for the purposes of the helper so they are
      mapped instead of duplicating code.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a33979d
  10. 04 10月, 2014 2 次提交
  11. 17 9月, 2014 1 次提交
  12. 08 9月, 2014 1 次提交
    • P
      rcu: Add call_rcu_tasks() · 8315f422
      Paul E. McKenney 提交于
      This commit adds a new RCU-tasks flavor of RCU, which provides
      call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
      context switch (not preemption!) and userspace execution (not the idle
      loop -- use some sort of schedule_on_each_cpu() if you need to handle the
      idle tasks.  Note that unlike other RCU flavors, these quiescent states
      occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.
      
      This RCU flavor is assumed to have very infrequent latency-tolerant
      updaters.  This assumption permits significant simplifications, including
      a single global callback list protected by a single global lock, along
      with a single task-private linked list containing all tasks that have not
      yet passed through a quiescent state.  If experience shows this assumption
      to be incorrect, the required additional complexity will be added.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8315f422
  13. 27 8月, 2014 1 次提交
    • B
      kbuild: handle module compression while running 'make modules_install'. · beb50df3
      Bertrand Jacquin 提交于
      Since module-init-tools (gzip) and kmod (gzip and xz) support compressed
      modules, it could be useful to include a support for compressing modules
      right after having them installed. Doing this in kbuild instead of per
      distro can permit to make this kind of usage more generic.
      
      This patch add a Kconfig entry to "Enable loadable module support" menu
      and let you choose to compress using gzip (default) or xz.
      
      Both gzip and xz does not used any extra -[1-9] option since Andi Kleen
      and Rusty Russell prove no gain is made using them. gzip is called with -n
      argument to avoid storing original filename inside compressed file, that
      way we can save some more bytes.
      
      On a v3.16 kernel, 'make allmodconfig' generated 4680 modules for a
      total of 378MB (no strip, no sign, no compress), the following table
      shows observed disk space gain based on the allmodconfig .config :
      
             |           time                |
             +-------------+-----------------+
             | manual .ko  |       make      | size | percent
             | compression | modules_install |      | gain
             +-------------+-----------------+------+--------
        -    |             |     18.61s      | 378M |
        GZIP |   3m16s     |     3m37s       | 102M | 73.41%
        XZ   |   5m22s     |     5m39s       |  77M | 79.83%
      
      The gain for restricted environnement seems to be interesting while
      uncompress can be time consuming but happens only while loading a module,
      that is generally done only once.
      
      This is fully compatible with signed modules while the signed module is
      compressed. module-init-tools or kmod handles decompression
      and provide to other layer the uncompressed but signed payload.
      Reviewed-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NBertrand Jacquin <beber@meleeweb.net>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      beb50df3
  14. 26 8月, 2014 1 次提交
  15. 18 8月, 2014 1 次提交
    • J
      mm: Support compiling out madvise and fadvise · d3ac21ca
      Josh Triplett 提交于
      Many embedded systems will not need these syscalls, and omitting them
      saves space.  Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
      (default y) to support compiling them out.
      
      bloat-o-meter:
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
      function                                     old     new   delta
      sys_fadvise64                                 57       -     -57
      sys_fadvise64_64                             691       -    -691
      sys_madvise                                 1502       -   -1502
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      d3ac21ca
  16. 15 8月, 2014 1 次提交
  17. 09 8月, 2014 1 次提交
    • V
      kernel: build bin2c based on config option CONFIG_BUILD_BIN2C · de5b56ba
      Vivek Goyal 提交于
      currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
      used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
      and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de5b56ba
  18. 07 8月, 2014 1 次提交
    • L
      printk: allow increasing the ring buffer depending on the number of CPUs · 23b2899f
      Luis R. Rodriguez 提交于
      The default size of the ring buffer is too small for machines with a
      large amount of CPUs under heavy load.  What ends up happening when
      debugging is the ring buffer overlaps and chews up old messages making
      debugging impossible unless the size is passed as a kernel parameter.
      An idle system upon boot up will on average spew out only about one or
      two extra lines but where this really matters is on heavy load and that
      will vary widely depending on the system and environment.
      
      There are mechanisms to help increase the kernel ring buffer for tracing
      through debugfs, and those interfaces even allow growing the kernel ring
      buffer per CPU.  We also have a static value which can be passed upon
      boot.  Relying on debugfs however is not ideal for production, and
      relying on the value passed upon bootup is can only used *after* an
      issue has creeped up.  Instead of being reactive this adds a proactive
      measure which lets you scale the amount of contributions you'd expect to
      the kernel ring buffer under load by each CPU in the worst case
      scenario.
      
      We use num_possible_cpus() to avoid complexities which could be
      introduced by dynamically changing the ring buffer size at run time,
      num_possible_cpus() lets us use the upper limit on possible number of
      CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
      This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
      which is used to specify the maximum amount of contributions to the
      kernel ring buffer in the worst case before the kernel ring buffer flips
      over, the size is specified as a power of 2.  The total amount of
      contributions made by each CPU must be greater than half of the default
      kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
      an increase upon bootup.  The kernel ring buffer is increased to the
      next power of two that would fit the required minimum kernel ring buffer
      size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
      is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
      in order to trigger an increase of the kernel ring buffer.  With a
      LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
      possible CPUs to trigger an increase.  If you had 128 possible CPUs the
      amount of minimum required kernel ring buffer bumps to:
      
         ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
      
      Since we require the ring buffer to be a power of two the new required
      size would be 1024 KB.
      
      This CPU contributions are ignored when the "log_buf_len" kernel
      parameter is used as it forces the exact size of the ring buffer to an
      expected power of two value.
      
      [pmladek@suse.cz: fix build]
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Tested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2899f
  19. 10 7月, 2014 1 次提交
  20. 08 7月, 2014 1 次提交
    • P
      rcu: Don't offload callbacks unless specifically requested · b58cc46c
      Paul E. McKenney 提交于
      Enabling NO_HZ_FULL currently has the side effect of enabling callback
      offloading on all CPUs.  This results in lots of additional rcuo kthreads,
      and can also increase context switching and wakeups, even in cases where
      callback offloading is neither needed nor particularly desirable.  This
      commit therefore enables callback offloading on a given CPU only if
      specifically requested at build time or boot time, or if that CPU has
      been specifically designated (again, either at build time or boot time)
      as a nohz_full CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b58cc46c
  21. 05 6月, 2014 4 次提交
  22. 19 4月, 2014 1 次提交
  23. 08 4月, 2014 1 次提交
  24. 04 4月, 2014 2 次提交
  25. 20 3月, 2014 2 次提交
  26. 03 3月, 2014 1 次提交
  27. 12 2月, 2014 1 次提交
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48