1. 17 12月, 2014 1 次提交
    • M
      init: fix read-write root mount · 10975933
      Miklos Szeredi 提交于
      If mount flags don't have MS_RDONLY, iso9660 returns EACCES without actually
      checking if it's an iso image.
      
      This tricks mount_block_root() into retrying with MS_RDONLY.  This results
      in a read-only root despite the "rw" boot parameter if the actual
      filesystem was checked after iso9660.
      
      I believe the behavior of iso9660 is okay, while that of mount_block_root()
      is not.  It should rather try all types without MS_RDONLY and only then
      retry with MS_RDONLY.
      
      This change also makes the code more robust against the case when EACCES is
      returned despite MS_RDONLY, which would've resulted in a lockup.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      10975933
  2. 15 12月, 2014 1 次提交
  3. 14 12月, 2014 1 次提交
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
  4. 11 12月, 2014 8 次提交
    • A
      take the targets of /proc/*/ns/* symlinks to separate fs · e149ed2b
      Al Viro 提交于
      New pseudo-filesystem: nsfs.  Targets of /proc/*/ns/* live there now.
      It's not mountable (not even registered, so it's not in /proc/filesystems,
      etc.).  Files on it *are* bindable - we explicitly permit that in do_loopback().
      
      This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
      get_proc_ns() is a macro now (it's simply returning ->i_private; would
      have been an inline, if not for header ordering headache).
      proc_ns_inode() is an ex-parrot.  The interface used in procfs is
      ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
      
      Dentries and inodes are never hashed; a non-counting reference to dentry
      is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
      if present.  See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
      of that mechanism.
      
      As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
      it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
      from ns_get_path().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e149ed2b
    • A
      init: allow CONFIG_INIT_FALLBACK=n to disable defaults if init= fails · 6ef4536e
      Andy Lutomirski 提交于
      If a user puts init=/whatever on the command line and /whatever can't be
      run, then the kernel will try a few default options before giving up.  If
      init=/whatever came from a bootloader prompt, then this is unexpected but
      probably harmless.  On the other hand, if it comes from a script (e.g.  a
      tool like virtme or perhaps a future kselftest script), then the fallbacks
      are likely to exist, but they'll do the wrong thing.  For example, they
      might unexpectedly invoke systemd.
      
      This adds a config option CONFIG_INIT_FALLBACK.  If unset, then a failure
      to run the specified init= process be fatal.
      
      The tentative plan is to remove CONFIG_INIT_FALLBACK for 3.20.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Rob Landley <rob@landley.net>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ef4536e
    • J
      mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea
      Johannes Weiner 提交于
      Now that the external page_cgroup data structure and its lookup is
      gone, let the generic bad_page() check for page->mem_cgroup sanity.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9edad6ea
    • J
      mm: embed the memcg pointer directly into struct page · 1306a85a
      Johannes Weiner 提交于
      Memory cgroups used to have 5 per-page pointers.  To allow users to
      disable that amount of overhead during runtime, those pointers were
      allocated in a separate array, with a translation layer between them and
      struct page.
      
      There is now only one page pointer remaining: the memcg pointer, that
      indicates which cgroup the page is associated with when charged.  The
      complexity of runtime allocation and the runtime translation overhead is
      no longer justified to save that *potential* 0.19% of memory.  With
      CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
      after the page->private member and doesn't even increase struct page,
      and then this patch actually saves space.  Remaining users that care can
      still compile their kernels without CONFIG_MEMCG.
      
           text    data     bss     dec     hex     filename
        8828345 1725264  983040 11536649 b00909  vmlinux.old
        8827425 1725264  966656 11519345 afc571  vmlinux.new
      
      [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306a85a
    • A
      mm/numa balancing: rearrange Kconfig entry · 6f7c97e8
      Aneesh Kumar K.V 提交于
      Add the default enable config option after the NUMA_BALANCING option so
      that it appears related in the nconfig interface.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f7c97e8
    • J
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner 提交于
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • J
      mm: hugetlb_cgroup: convert to lockless page counters · 71f87bee
      Johannes Weiner 提交于
      Abandon the spinlock-protected byte counters in favor of the unlocked
      page counters in the hugetlb controller as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71f87bee
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  5. 05 12月, 2014 2 次提交
  6. 18 11月, 2014 1 次提交
    • D
      integrity: provide a hook to load keys when rootfs is ready · c9cd2ce2
      Dmitry Kasatkin 提交于
      Keys can only be loaded once the rootfs is mounted. Initcalls
      are not suitable for that. This patch defines a special hook
      to load the x509 public keys onto the IMA keyring, before
      attempting to access any file. The keys are required for
      verifying the file's signature. The hook is called after the
      root filesystem is mounted and before the kernel calls 'init'.
      
      Changes in v3:
      * added more explanation to the patch description (Mimi)
      
      Changes in v2:
      * Hook renamed as 'integrity_load_keys()' to handle both IMA and EVM
        keys by integrity subsystem.
      * Hook patch moved after defining loading functions
      Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      c9cd2ce2
  7. 11 11月, 2014 1 次提交
    • D
      param: fix crash on bad kernel arguments · 3438cf54
      Daniel Thompson 提交于
      Currently if the user passes an invalid value on the kernel command line
      then the kernel will crash during argument parsing. On most systems this
      is very hard to debug because the console hasn't been initialized yet.
      
      This is a regression due to commit 51e158c1 ("param: hand arguments
      after -- straight to init") which, in response to the systemd debug
      controversy, made it possible to explicitly pass arguments to init. To
      achieve this parse_args() was extended from simply returning an error
      code to returning a pointer. Regretably the new init args logic does not
      perform a proper validity check on the pointer resulting in a crash.
      
      This patch fixes the validity check. Should the check fail then no arguments
      will be passed to init. This is reasonable and matches how the kernel treats
      its own arguments (i.e. no error recovery).
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      3438cf54
  8. 30 10月, 2014 2 次提交
  9. 29 10月, 2014 1 次提交
  10. 28 10月, 2014 1 次提交
    • A
      bpf: split eBPF out of NET · f89b7755
      Alexei Starovoitov 提交于
      introduce two configs:
      - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
        depend on
      - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
      
      that solves several problems:
      - tracing and others that wish to use eBPF don't need to depend on NET.
        They can use BPF_SYSCALL to allow loading from userspace or select BPF
        to use it directly from kernel in NET-less configs.
      - in 3.18 programs cannot be attached to events yet, so don't force it on
      - when the rest of eBPF infra is there in 3.19+, it's still useful to
        switch it off to minimize kernel size
      
      bloat-o-meter on x64 shows:
      add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
      
      tested with many different config combinations. Hopefully didn't miss anything.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f89b7755
  11. 14 10月, 2014 3 次提交
  12. 10 10月, 2014 1 次提交
    • M
      mm: remove misleading ARCH_USES_NUMA_PROT_NONE · 6a33979d
      Mel Gorman 提交于
      ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
      _PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
      relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
      fault scanner.  This was found to be conceptually confusing with a lot of
      implicit assumptions and it was asked that an alternative be found.
      
      Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
      PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
      bits and shrunk the maximum possible swap size but it did not go far
      enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
      but the relics still exist.
      
      This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
      duplication in powerpc vs the generic implementation by defining the types
      the core NUMA helpers expected to exist from x86 with their ppc64
      equivalent.  This necessitated that a PTE bit mask be created that
      identified the bits that distinguish present from NUMA pte entries but it
      is expected this will only differ between arches based on _PAGE_PROTNONE.
      The naming for the generic helpers was taken from x86 originally but ppc64
      has types that are equivalent for the purposes of the helper so they are
      mapped instead of duplicating code.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a33979d
  13. 04 10月, 2014 2 次提交
  14. 19 9月, 2014 1 次提交
    • A
      init/main.c: Give init_task a canary · d4311ff1
      Aaron Tomlin 提交于
      Tasks get their end of stack set to STACK_END_MAGIC with the
      aim to catch stack overruns. Currently this feature does not
      apply to init_task. This patch removes this restriction.
      
      Note that a similar patch was posted by Prarit Bhargava
      some time ago but was never merged:
      
        http://marc.info/?l=linux-kernel&m=127144305403241&w=2Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dzickus@redhat.com
      Cc: bmr@redhat.com
      Cc: jcastillo@redhat.com
      Cc: jgh@redhat.com
      Cc: minchan@kernel.org
      Cc: tglx@linutronix.de
      Cc: hannes@cmpxchg.org
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4311ff1
  15. 17 9月, 2014 2 次提交
    • P
      Revert "init: make rootdelay=N consistent with rootwait behaviour" · 8ba4caf1
      Paul Gortmaker 提交于
      This reverts commit 4dfe694f.
      
      In that, we did:
      
        Here we move the rootdelay code to be right beside the rootwait code, so
        that their behaviour is consistent.
      
      ...which is fine, but in hindsight, perhaps moving the rootwait to be
      beside the rootdelay would have been better.  We also indicated:
      
        It should be noted that in doing so, the actions based on the
        saved_root_name[0] and initrd_load() were previously put on hold by
        rootdelay=N and now currently will not be delayed.  However, I think
        consistent behaviour is more important than matching historical behaviour
        of delaying the above two operations.
      
      But Pavel reported an instance where an ARM target with root on MMC
      was failing to mount root, and Russell diagnosed it to the fact that
      the call to set ROOT_DEV within the saved_root_name[0] processing
      block mentioned above was no longer being delayed.
      
      Rather than moving both wait clauses to the original position of
      rootdelay and risking unearthing other possible corner case breakage
      at this point in time, we simply revert now and we can revisit
      trying the alternate/earlier location in another development cycle.
      
      Cc: Pavel Machek <pavel@denx.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ba4caf1
    • P
      rcu: Fix attempt to avoid unsolicited offloading of callbacks · f4579fc5
      Paul E. McKenney 提交于
      Commit b58cc46c (rcu: Don't offload callbacks unless specifically
      requested) failed to adjust the callback lists of the CPUs that are
      known to be no-CBs CPUs only because they are also nohz_full= CPUs.
      This failure can result in callbacks that are posted during early boot
      getting stranded on nxtlist for CPUs whose no-CBs property becomes
      apparent late, and there can also be spurious warnings about offline
      CPUs posting callbacks.
      
      This commit fixes these problems by adding an early-boot rcu_init_nohz()
      that properly initializes the no-CBs CPUs.
      
      Note that kernels built with CONFIG_RCU_NOCB_CPU_ALL=y or with
      CONFIG_RCU_NOCB_CPU=n do not exhibit this bug.  Neither do kernels
      booted without the nohz_full= boot parameter.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      f4579fc5
  16. 14 9月, 2014 1 次提交
    • F
      nohz: Move nohz full init call to tick init · a80e49e2
      Frederic Weisbecker 提交于
      This way we unbloat a bit main.c and more importantly we initialize
      nohz full after init_IRQ(). This dependency will be needed in further
      patches because nohz full needs irq work to raise its own IRQ.
      Information about the support for this ability on ARM64 is obtained on
      init_IRQ() which initialize the pointer to __smp_call_function.
      
      Since tick_init() is called right after init_IRQ(), this is a good place
      to call tick_nohz_init() and prepare for that dependency.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a80e49e2
  17. 08 9月, 2014 1 次提交
    • P
      rcu: Add call_rcu_tasks() · 8315f422
      Paul E. McKenney 提交于
      This commit adds a new RCU-tasks flavor of RCU, which provides
      call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
      context switch (not preemption!) and userspace execution (not the idle
      loop -- use some sort of schedule_on_each_cpu() if you need to handle the
      idle tasks.  Note that unlike other RCU flavors, these quiescent states
      occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.
      
      This RCU flavor is assumed to have very infrequent latency-tolerant
      updaters.  This assumption permits significant simplifications, including
      a single global callback list protected by a single global lock, along
      with a single task-private linked list containing all tasks that have not
      yet passed through a quiescent state.  If experience shows this assumption
      to be incorrect, the required additional complexity will be added.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8315f422
  18. 28 8月, 2014 1 次提交
  19. 27 8月, 2014 1 次提交
    • B
      kbuild: handle module compression while running 'make modules_install'. · beb50df3
      Bertrand Jacquin 提交于
      Since module-init-tools (gzip) and kmod (gzip and xz) support compressed
      modules, it could be useful to include a support for compressing modules
      right after having them installed. Doing this in kbuild instead of per
      distro can permit to make this kind of usage more generic.
      
      This patch add a Kconfig entry to "Enable loadable module support" menu
      and let you choose to compress using gzip (default) or xz.
      
      Both gzip and xz does not used any extra -[1-9] option since Andi Kleen
      and Rusty Russell prove no gain is made using them. gzip is called with -n
      argument to avoid storing original filename inside compressed file, that
      way we can save some more bytes.
      
      On a v3.16 kernel, 'make allmodconfig' generated 4680 modules for a
      total of 378MB (no strip, no sign, no compress), the following table
      shows observed disk space gain based on the allmodconfig .config :
      
             |           time                |
             +-------------+-----------------+
             | manual .ko  |       make      | size | percent
             | compression | modules_install |      | gain
             +-------------+-----------------+------+--------
        -    |             |     18.61s      | 378M |
        GZIP |   3m16s     |     3m37s       | 102M | 73.41%
        XZ   |   5m22s     |     5m39s       |  77M | 79.83%
      
      The gain for restricted environnement seems to be interesting while
      uncompress can be time consuming but happens only while loading a module,
      that is generally done only once.
      
      This is fully compatible with signed modules while the signed module is
      compressed. module-init-tools or kmod handles decompression
      and provide to other layer the uncompressed but signed payload.
      Reviewed-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NBertrand Jacquin <beber@meleeweb.net>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      beb50df3
  20. 26 8月, 2014 1 次提交
  21. 18 8月, 2014 1 次提交
    • J
      mm: Support compiling out madvise and fadvise · d3ac21ca
      Josh Triplett 提交于
      Many embedded systems will not need these syscalls, and omitting them
      saves space.  Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
      (default y) to support compiling them out.
      
      bloat-o-meter:
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
      function                                     old     new   delta
      sys_fadvise64                                 57       -     -57
      sys_fadvise64_64                             691       -    -691
      sys_madvise                                 1502       -   -1502
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      d3ac21ca
  22. 15 8月, 2014 1 次提交
  23. 09 8月, 2014 5 次提交
    • V
      kernel: build bin2c based on config option CONFIG_BUILD_BIN2C · de5b56ba
      Vivek Goyal 提交于
      currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
      used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
      and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de5b56ba
    • F
      init/main.c: code clean-up · dd4d9fec
      Fabian Frederick 提交于
      Fixing some checkpatch warnings(remove global initialization, move
      __initdata, coalesce formats ...)
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd4d9fec
    • D
      initramfs: add write error checks · 9687fd91
      David Engraf 提交于
      On a system with low memory extracting the initramfs may fail.  If this
      happens the user gets "Failed to execute /init" instead of an initramfs
      error.
      
      Check return value of sys_write and call error() when the write was
      incomplete or failed.
      Signed-off-by: NDavid Engraf <david.engraf@sysgo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9687fd91
    • Y
      initramfs: support initramfs that is bigger than 2GiB · d97b07c5
      Yinghai Lu 提交于
      Now with 64bit bzImage and kexec tools, we support ramdisk that size is
      bigger than 2g, as we could put it above 4G.
      
      Found compressed initramfs image could not be decompressed properly.  It
      turns out that image length is int during decompress detection, and it
      will become < 0 when length is more than 2G.  Furthermore, during
      decompressing len as int is used for inbuf count, that has problem too.
      
      Change len to long, that should be ok as on 32 bit platform long is
      32bits.
      
      Tested with following compressed initramfs image as root with kexec.
      	gzip, bzip2, xz, lzma, lzop, lz4.
      run time for populate_rootfs():
         size        name       Nehalem-EX  Westmere-EX  Ivybridge-EX
       9034400256 root_img     :   26s           24s          30s
       3561095057 root_img.lz4 :   28s           27s          27s
       3459554629 root_img.lzo :   29s           29s          28s
       3219399480 root_img.gz  :   64s           62s          49s
       2251594592 root_img.xz  :  262s          260s         183s
       2226366598 root_img.lzma:  386s          376s         277s
       2901482513 root_img.bz2 :  635s          599s
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Rashika Kheria <rashika.kheria@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kyungsik Lee <kyungsik.lee@lge.com>
      Cc: P J P <ppandit@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: "Daniel M. Weeks" <dan@danweeks.net>
      Cc: Alexandre Courbot <acourbot@nvidia.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b07c5
    • Y
      initramfs: support initrd that is bigger than 2GiB · 38747439
      Yinghai Lu 提交于
      When initrd (compressed or not) is used, kernel report data corrupted with
      /dev/ram0.
      
      The root cause:
      During initramfs checking, if it is initrd, it will be transferred to
      /initrd.image with sys_write.
      sys_write only support 2G-4K write, so if the initrd ram is more than
      that, /initrd.image will not complete at all.
      
      Add local xwrite to loop calling sys_write to workaround the problem.
      
      Also need to use xwrite in write_buffer() to handle:
      image is uncompressed cpio and there is one big file (>2G) in it.
         unpack_to_rootfs ===> write_buffer ===> actions[]/do_copy
      
      At the same time, we don't need to worry about sys_read/sys_write in
      do_mounts_rd.c::crd_load.  As decompressor will have fill/flush and local
      buffer that is smaller than 2G.
      
      Test with uncompressed initrd, and compressed ones with gz, bz2, lzma,xz,
      lzop.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: "Daniel M. Weeks" <dan@danweeks.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38747439