1. 15 8月, 2014 1 次提交
  2. 09 8月, 2014 6 次提交
    • V
      kernel: build bin2c based on config option CONFIG_BUILD_BIN2C · de5b56ba
      Vivek Goyal 提交于
      currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
      used by kexec too.  So make it compilation dependent on CONFIG_BUILD_BIN2C
      and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de5b56ba
    • F
      init/main.c: code clean-up · dd4d9fec
      Fabian Frederick 提交于
      Fixing some checkpatch warnings(remove global initialization, move
      __initdata, coalesce formats ...)
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd4d9fec
    • D
      initramfs: add write error checks · 9687fd91
      David Engraf 提交于
      On a system with low memory extracting the initramfs may fail.  If this
      happens the user gets "Failed to execute /init" instead of an initramfs
      error.
      
      Check return value of sys_write and call error() when the write was
      incomplete or failed.
      Signed-off-by: NDavid Engraf <david.engraf@sysgo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9687fd91
    • Y
      initramfs: support initramfs that is bigger than 2GiB · d97b07c5
      Yinghai Lu 提交于
      Now with 64bit bzImage and kexec tools, we support ramdisk that size is
      bigger than 2g, as we could put it above 4G.
      
      Found compressed initramfs image could not be decompressed properly.  It
      turns out that image length is int during decompress detection, and it
      will become < 0 when length is more than 2G.  Furthermore, during
      decompressing len as int is used for inbuf count, that has problem too.
      
      Change len to long, that should be ok as on 32 bit platform long is
      32bits.
      
      Tested with following compressed initramfs image as root with kexec.
      	gzip, bzip2, xz, lzma, lzop, lz4.
      run time for populate_rootfs():
         size        name       Nehalem-EX  Westmere-EX  Ivybridge-EX
       9034400256 root_img     :   26s           24s          30s
       3561095057 root_img.lz4 :   28s           27s          27s
       3459554629 root_img.lzo :   29s           29s          28s
       3219399480 root_img.gz  :   64s           62s          49s
       2251594592 root_img.xz  :  262s          260s         183s
       2226366598 root_img.lzma:  386s          376s         277s
       2901482513 root_img.bz2 :  635s          599s
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Rashika Kheria <rashika.kheria@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kyungsik Lee <kyungsik.lee@lge.com>
      Cc: P J P <ppandit@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: "Daniel M. Weeks" <dan@danweeks.net>
      Cc: Alexandre Courbot <acourbot@nvidia.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b07c5
    • Y
      initramfs: support initrd that is bigger than 2GiB · 38747439
      Yinghai Lu 提交于
      When initrd (compressed or not) is used, kernel report data corrupted with
      /dev/ram0.
      
      The root cause:
      During initramfs checking, if it is initrd, it will be transferred to
      /initrd.image with sys_write.
      sys_write only support 2G-4K write, so if the initrd ram is more than
      that, /initrd.image will not complete at all.
      
      Add local xwrite to loop calling sys_write to workaround the problem.
      
      Also need to use xwrite in write_buffer() to handle:
      image is uncompressed cpio and there is one big file (>2G) in it.
         unpack_to_rootfs ===> write_buffer ===> actions[]/do_copy
      
      At the same time, we don't need to worry about sys_read/sys_write in
      do_mounts_rd.c::crd_load.  As decompressor will have fill/flush and local
      buffer that is smaller than 2G.
      
      Test with uncompressed initrd, and compressed ones with gz, bz2, lzma,xz,
      lzop.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: "Daniel M. Weeks" <dan@danweeks.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38747439
    • P
      init: make rootdelay=N consistent with rootwait behaviour · 4dfe694f
      Paul Gortmaker 提交于
      Currently rootdelay=N and rootwait behave differently (aside from the
      obvious unbounded wait duration) because they are at different places in
      the init sequence.
      
      The difference manifests itself for md devices because the call to
      md_run_setup() lives between rootdelay and rootwait, so if you try to use
      rootdelay=20 to try and allow a slow RAID0 array to assemble, you get
      this:
      
      [    4.526011] sd 6:0:0:0: [sdc] Attached SCSI removable disk
      [   22.972079] md: Waiting for all devices to be available before autodetect
      
      i.e.  you've achieved nothing other than delaying the probing 20s, when
      what you wanted was a 20s delay _after_ the probing for md devices was
      initiated.
      
      Here we move the rootdelay code to be right beside the rootwait code, so
      that their behaviour is consistent.
      
      It should be noted that in doing so, the actions based on the
      saved_root_name[0] and initrd_load() were previously put on hold by
      rootdelay=N and now currently will not be delayed.  However, I think
      consistent behaviour is more important than matching historical behaviour
      of delaying the above two operations.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dfe694f
  3. 07 8月, 2014 1 次提交
    • L
      printk: allow increasing the ring buffer depending on the number of CPUs · 23b2899f
      Luis R. Rodriguez 提交于
      The default size of the ring buffer is too small for machines with a
      large amount of CPUs under heavy load.  What ends up happening when
      debugging is the ring buffer overlaps and chews up old messages making
      debugging impossible unless the size is passed as a kernel parameter.
      An idle system upon boot up will on average spew out only about one or
      two extra lines but where this really matters is on heavy load and that
      will vary widely depending on the system and environment.
      
      There are mechanisms to help increase the kernel ring buffer for tracing
      through debugfs, and those interfaces even allow growing the kernel ring
      buffer per CPU.  We also have a static value which can be passed upon
      boot.  Relying on debugfs however is not ideal for production, and
      relying on the value passed upon bootup is can only used *after* an
      issue has creeped up.  Instead of being reactive this adds a proactive
      measure which lets you scale the amount of contributions you'd expect to
      the kernel ring buffer under load by each CPU in the worst case
      scenario.
      
      We use num_possible_cpus() to avoid complexities which could be
      introduced by dynamically changing the ring buffer size at run time,
      num_possible_cpus() lets us use the upper limit on possible number of
      CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
      This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
      which is used to specify the maximum amount of contributions to the
      kernel ring buffer in the worst case before the kernel ring buffer flips
      over, the size is specified as a power of 2.  The total amount of
      contributions made by each CPU must be greater than half of the default
      kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
      an increase upon bootup.  The kernel ring buffer is increased to the
      next power of two that would fit the required minimum kernel ring buffer
      size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
      is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
      in order to trigger an increase of the kernel ring buffer.  With a
      LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
      possible CPUs to trigger an increase.  If you had 128 possible CPUs the
      amount of minimum required kernel ring buffer bumps to:
      
         ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
      
      Since we require the ring buffer to be a power of two the new required
      size would be 1024 KB.
      
      This CPU contributions are ignored when the "log_buf_len" kernel
      parameter is used as it forces the exact size of the ring buffer to an
      expected power of two value.
      
      [pmladek@suse.cz: fix build]
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Tested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2899f
  4. 10 7月, 2014 1 次提交
  5. 08 7月, 2014 1 次提交
    • P
      rcu: Don't offload callbacks unless specifically requested · b58cc46c
      Paul E. McKenney 提交于
      Enabling NO_HZ_FULL currently has the side effect of enabling callback
      offloading on all CPUs.  This results in lots of additional rcuo kthreads,
      and can also increase context switching and wakeups, even in cases where
      callback offloading is neither needed nor particularly desirable.  This
      commit therefore enables callback offloading on a given CPU only if
      specifically requested at build time or boot time, or if that CPU has
      been specifically designated (again, either at build time or boot time)
      as a nohz_full CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b58cc46c
  6. 17 6月, 2014 1 次提交
  7. 05 6月, 2014 10 次提交
  8. 06 5月, 2014 1 次提交
  9. 05 5月, 2014 1 次提交
  10. 01 5月, 2014 1 次提交
    • H
      x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack · 3891a04a
      H. Peter Anvin 提交于
      The IRET instruction, when returning to a 16-bit segment, only
      restores the bottom 16 bits of the user space stack pointer.  This
      causes some 16-bit software to break, but it also leaks kernel state
      to user space.  We have a software workaround for that ("espfix") for
      the 32-bit kernel, but it relies on a nonzero stack segment base which
      is not available in 64-bit mode.
      
      In checkin:
      
          b3b42ac2 x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
      
      we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
      the logic that 16-bit support is crippled on 64-bit kernels anyway (no
      V86 support), but it turns out that people are doing stuff like
      running old Win16 binaries under Wine and expect it to work.
      
      This works around this by creating percpu "ministacks", each of which
      is mapped 2^16 times 64K apart.  When we detect that the return SS is
      on the LDT, we copy the IRET frame to the ministack and use the
      relevant alias to return to userspace.  The ministacks are mapped
      readonly, so if IRET faults we promote #GP to #DF which is an IST
      vector and thus has its own stack; we then do the fixup in the #DF
      handler.
      
      (Making #GP an IST exception would make the msr_safe functions unsafe
      in NMI/MC context, and quite possibly have other effects.)
      
      Special thanks to:
      
      - Andy Lutomirski, for the suggestion of using very small stack slots
        and copy (as opposed to map) the IRET frame there, and for the
        suggestion to mark them readonly and let the fault promote to #DF.
      - Konrad Wilk for paravirt fixup and testing.
      - Borislav Petkov for testing help and useful comments.
      Reported-by: NBrian Gerst <brgerst@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Lutomriski <amluto@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dirk Hohndel <dirk@hohndel.org>
      Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
      Cc: comex <comexk@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: <stable@vger.kernel.org> # consider after upstream merge
      3891a04a
  11. 28 4月, 2014 1 次提交
    • R
      param: hand arguments after -- straight to init · 51e158c1
      Rusty Russell 提交于
      The kernel passes any args it doesn't need through to init, except it
      assumes anything containing '.' belongs to the kernel (for a module).
      This change means all users can clearly distinguish which arguments
      are for init.
      
      For example, the kernel uses debug ("dee-bug") to mean log everything to
      the console, where systemd uses the debug from the Scandinavian "day-boog"
      meaning "fail to boot".  If a future versions uses argv[] instead of
      reading /proc/cmdline, this confusion will be avoided.
      
      eg: test 'FOO="this is --foo"' -- 'systemd.debug="true true true"'
      
      Gives:
      argv[0] = '/debug-init'
      argv[1] = 'test'
      argv[2] = 'systemd.debug=true true true'
      envp[0] = 'HOME=/'
      envp[1] = 'TERM=linux'
      envp[2] = 'FOO=this is --foo'
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      51e158c1
  12. 19 4月, 2014 1 次提交
  13. 08 4月, 2014 2 次提交
  14. 04 4月, 2014 3 次提交
  15. 20 3月, 2014 2 次提交
  16. 13 3月, 2014 1 次提交
  17. 03 3月, 2014 1 次提交
  18. 12 2月, 2014 1 次提交
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
  19. 06 2月, 2014 1 次提交
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  20. 01 2月, 2014 1 次提交
  21. 28 1月, 2014 1 次提交
  22. 25 1月, 2014 1 次提交