1. 13 8月, 2014 1 次提交
  2. 09 8月, 2014 28 次提交
    • V
      kexec: verify the signature of signed PE bzImage · 8e7d8381
      Vivek Goyal 提交于
      This is the final piece of the puzzle of verifying kernel image signature
      during kexec_file_load() syscall.
      
      This patch calls into PE file routines to verify signature of bzImage.  If
      signature are valid, kexec_file_load() succeeds otherwise it fails.
      
      Two new config options have been introduced.  First one is
      CONFIG_KEXEC_VERIFY_SIG.  This option enforces that kernel has to be
      validly signed otherwise kernel load will fail.  If this option is not
      set, no signature verification will be done.  Only exception will be when
      secureboot is enabled.  In that case signature verification should be
      automatically enforced when secureboot is enabled.  But that will happen
      when secureboot patches are merged.
      
      Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG.  This option
      enables signature verification support on bzImage.  If this option is not
      set and previous one is set, kernel image loading will fail because kernel
      does not have support to verify signature of bzImage.
      
      I tested these patches with both "pesign" and "sbsign" signed bzImages.
      
      I used signing_key.priv key and signing_key.x509 cert for signing as
      generated during kernel build process (if module signing is enabled).
      
      Used following method to sign bzImage.
      
      pesign
      ======
      - Convert DER format cert to PEM format cert
      openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform
      PEM
      
      - Generate a .p12 file from existing cert and private key file
      openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in
      signing_key.x509.PEM
      
      - Import .p12 file into pesign db
      pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign
      
      - Sign bzImage
      pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign
      -c "Glacier signing key - Magrathea" -s
      
      sbsign
      ======
      sbsign --key signing_key.priv --cert signing_key.x509.PEM --output
      /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+
      
      Patch details:
      
      Well all the hard work is done in previous patches.  Now bzImage loader
      has just call into that code and verify whether bzImage signature are
      valid or not.
      
      Also create two config options.  First one is CONFIG_KEXEC_VERIFY_SIG.
      This option enforces that kernel has to be validly signed otherwise kernel
      load will fail.  If this option is not set, no signature verification will
      be done.  Only exception will be when secureboot is enabled.  In that case
      signature verification should be automatically enforced when secureboot is
      enabled.  But that will happen when secureboot patches are merged.
      
      Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG.  This option
      enables signature verification support on bzImage.  If this option is not
      set and previous one is set, kernel image loading will fail because kernel
      does not have support to verify signature of bzImage.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Fleming <matt@console-pimps.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e7d8381
    • V
      kexec: support for kexec on panic using new system call · dd5f7260
      Vivek Goyal 提交于
      This patch adds support for loading a kexec on panic (kdump) kernel usning
      new system call.
      
      It prepares ELF headers for memory areas to be dumped and for saved cpu
      registers.  Also prepares the memory map for second kernel and limits its
      boot to reserved areas only.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd5f7260
    • V
      kexec-bzImage64: support for loading bzImage using 64bit entry · 27f48d3e
      Vivek Goyal 提交于
      This is loader specific code which can load bzImage and set it up for
      64bit entry.  This does not take care of 32bit entry or real mode entry.
      
      32bit mode entry can be implemented if somebody needs it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27f48d3e
    • V
      kexec: load and relocate purgatory at kernel load time · 12db5562
      Vivek Goyal 提交于
      Load purgatory code in RAM and relocate it based on the location.
      Relocation code has been inspired by module relocation code and purgatory
      relocation code in kexec-tools.
      
      Also compute the checksums of loaded kexec segments and store them in
      purgatory.
      
      Arch independent code provides this functionality so that arch dependent
      bootloaders can make use of it.
      
      Helper functions are provided to get/set symbol values in purgatory which
      are used by bootloaders later to set things like stack and entry point of
      second kernel etc.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12db5562
    • V
      kexec: implementation of new syscall kexec_file_load · cb105258
      Vivek Goyal 提交于
      Previous patch provided the interface definition and this patch prvides
      implementation of new syscall.
      
      Previously segment list was prepared in user space.  Now user space just
      passes kernel fd, initrd fd and command line and kernel will create a
      segment list internally.
      
      This patch contains generic part of the code.  Actual segment preparation
      and loading is done by arch and image specific loader.  Which comes in
      next patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb105258
    • V
      kexec: new syscall kexec_file_load() declaration · f0895685
      Vivek Goyal 提交于
      This is the new syscall kexec_file_load() declaration/interface.  I have
      reserved the syscall number only for x86_64 so far.  Other architectures
      (including i386) can reserve syscall number when they enable the support
      for this new syscall.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0895685
    • V
      resource: provide new functions to walk through resources · 8c86e70a
      Vivek Goyal 提交于
      I have added two more functions to walk through resources.
      
      Currently walk_system_ram_range() deals with pfn and /proc/iomem can
      contain partial pages.  By dealing in pfn, callback function loses the
      info that last page of a memory range is a partial page and not the full
      page.  So I implemented walk_system_ram_res() which returns u64 values to
      callback functions and now it properly return start and end address.
      
      walk_system_ram_range() uses find_next_system_ram() to find the next ram
      resource.  This in turn only travels through siblings of top level child
      and does not travers through all the nodes of the resoruce tree.  I also
      need another function where I can walk through all the resources, for
      example figure out where "GART" aperture is.  Figure out where ACPI memory
      is.
      
      So I wrote another function walk_iomem_res() which walks through all
      /proc/iomem resources and returns matches as asked by caller.  Caller can
      specify "name" of resource, start and end and flags.
      
      Got rid of find_next_system_ram_res() and instead implemented more generic
      find_next_iomem_res() which can be used to traverse top level children
      only based on an argument.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c86e70a
    • V
      kexec: use common function for kimage_normal_alloc() and kimage_crash_alloc() · 255aedd9
      Vivek Goyal 提交于
      kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar
      things and differ only little.  So instead of having two separate
      functions create a common function kimage_alloc_init() and pass it the
      "flags" argument which tells whether it is normal kexec or kexec_on_panic.
       And this function should be able to deal with both the cases.
      
      This consolidation also helps later where we can use a common function
      kimage_file_alloc_init() to handle normal and crash cases for new file
      based kexec syscall.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      255aedd9
    • V
      kexec: move segment verification code in a separate function · dabe7862
      Vivek Goyal 提交于
      Previously do_kimage_alloc() will allocate a kimage structure, copy
      segment list from user space and then do the segment list sanity
      verification.
      
      Break down this function in 3 parts.  do_kimage_alloc_init() to do actual
      allocation and basic initialization of kimage structure.
      copy_user_segment_list() to copy segment list from user space and
      sanity_check_segment_list() to verify the sanity of segment list as passed
      by user space.
      
      In later patches, I need to only allocate kimage and not copy segment list
      from user space.  So breaking down in smaller functions enables re-use of
      code at other places.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dabe7862
    • V
      kexec: rename unusebale_pages to unusable_pages · 7d3e2bca
      Vivek Goyal 提交于
      Let's use the more common "unusable".
      
      This patch was originally written and posted by Boris. I am including it
      in this patch series.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3e2bca
    • V
      bin2c: move bin2c in scripts/basic · 8370edea
      Vivek Goyal 提交于
      This patch series does not do kernel signature verification yet.  I plan
      to post another patch series for that.  Now distributions are already
      signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
      those signatures.
      
      Primary goal of this patchset is to prepare groundwork so that kernel
      image can be signed and signatures be verified during kexec load.  This
      should help with two things.
      
      - It should allow kexec/kdump on secureboot enabled machines.
      
      - In general it can help even without secureboot. By being able to verify
        kernel image signature in kexec, it should help with avoiding module
        signing restrictions. Matthew Garret showed how to boot into a custom
        kernel, modify first kernel's memory and then jump back to old kernel and
        bypass any policy one wants to.
      
      This patch (of 15):
      
      Kexec wants to use bin2c and it wants to use it really early in the build
      process. See arch/x86/purgatory/ code in later patches.
      
      So move bin2c in scripts/basic so that it can be built very early and
      be usable by arch/x86/purgatory/
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8370edea
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • I
    • J
      shm: make exit_shm work proportional to task activity · ab602f79
      Jack Miller 提交于
      This is small set of patches our team has had kicking around for a few
      versions internally that fixes tasks getting hung on shm_exit when there
      are many threads hammering it at once.
      
      Anton wrote a simple test to cause the issue:
      
        http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
      
      Before applying this patchset, this test code will cause either hanging
      tracebacks or pthread out of memory errors.
      
      After this patchset, it will still produce output like:
      
        root@somehost:~# ./bust_shm_exit 1024 160
        ...
        INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
        INFO: Stall ended before state dump start
        ...
      
      But the task will continue to run along happily, so we consider this an
      improvement over hanging, even if it's a bit noisy.
      
      This patch (of 3):
      
      exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
      walks every shared memory segment in the namespace.  Thus the amount of
      work is related to the number of shm segments in the namespace not the
      number of segments that might need to be cleaned.
      
      In addition, this occurs after the task has been notified the thread has
      exited, so the number of tasks waiting for the ns shm rwsem can grow
      without bound until memory is exausted.
      
      Add a list to the task struct of all shmids allocated by this task.  Init
      the list head in copy_process.  Use the ns->rwsem for locking.  Add
      segments after id is added, remove before removing from id.
      
      On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
      to handling of semaphore undo.
      
      I chose a define for the init sequence since its a simple list init,
      otherwise it would require a function call to avoid include loops between
      the semaphore code and the task struct.  Converting the list_del to
      list_del_init for the unshare cases would remove the exit followed by
      init, but I left it blow up if not inited.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab602f79
    • J
      panic: add TAINT_SOFTLOCKUP · 69361eef
      Josh Hunt 提交于
      This taint flag will be set if the system has ever entered a softlockup
      state.  Similar to TAINT_WARN it is useful to know whether or not the
      system has been in a softlockup state when debugging.
      
      [akpm@linux-foundation.org: apply the taint before calling panic()]
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69361eef
    • F
      kernel/gcov/fs.c: remove unnecessary null test before debugfs_remove · 834b18b2
      Fabian Frederick 提交于
      This fixes checkpatch warning:
      
        WARNING: debugfs_remove(NULL) is safe this check is probably not required
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      834b18b2
    • V
      kernel/fork.c: make mm_init_owner static · 33144e84
      Vladimir Davydov 提交于
      It's only used in fork.c:mm_init().
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33144e84
    • V
      fork: copy mm's vm usage counters under mmap_sem · 4f7d4614
      Vladimir Davydov 提交于
      If a forking process has a thread calling (un)mmap (silly but still),
      the child process may have some of its mm's vm usage counters (total_vm
      and friends) screwed up, because currently they are copied from oldmm
      w/o holding any locks (memcpy in dup_mm).
      
      This patch moves the counters initialization to dup_mmap() to be called
      under oldmm->mmap_sem, which eliminates any possibility of race.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f7d4614
    • V
      fork: reset mm->pinned_vm · ce65cefa
      Vladimir Davydov 提交于
      mm->pinned_vm counts pages of mm's address space that were permanently
      pinned in memory by increasing their reference counter. The counter was
      introduced by commit bc3e53f6 ("mm: distinguish between mlocked and
      pinned pages"), while before it locked_vm had been used for such pages.
      
      Obviously, we should reset the counter on fork if !CLONE_VM, just like
      we do with locked_vm, but currently we don't. Let's fix it.
      
      This patch will fix the contents of /proc/pid/status:VmPin.
      
      ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
      RLIMIT_MEMLOCK.  It's left from the times when pinned pages were accounted
      under locked_vm, but today it looks wrong.  It isn't clear how we should
      deal with it.
      
      We still have some drivers accounting pinned pages under mm->locked_vm -
      this is what commit bc3e53f6 was fighting against.  It's
      infiniband/usnic and vfio.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce65cefa
    • V
      fork/exec: cleanup mm initialization · 41f727fd
      Vladimir Davydov 提交于
      mm initialization on fork/exec is spread all over the place, which makes
      the code look inconsistent.
      
      We have mm_init(), which is supposed to init/nullify mm's internals, but
      it doesn't init all the fields it should:
      
       - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
         are zeroed in dup_mmap();
      
       - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
         calling mm_init();
      
       - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
         called before mm_init() on both fork and exec;
      
       - ->context is initialized by init_new_context(), which is called after
         mm_init() on both fork and exec;
      
      Let's consolidate all the initializations in mm_init() to make the code
      look cleaner.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41f727fd
    • F
      proc: constify seq_operations · ccf94f1b
      Fabian Frederick 提交于
      proc_uid_seq_operations, proc_gid_seq_operations and
      proc_projid_seq_operations are only called in proc_id_map_open with
      seq_open as const struct seq_operations so we can constify the 3
      structures and update proc_id_map_open prototype.
      
         text    data     bss     dec     hex filename
         6817     404    1984    9205    23f5 kernel/user_namespace.o-before
         6913     308    1984    9205    23f5 kernel/user_namespace.o-after
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccf94f1b
    • I
      kernel/exit.c: fix coding style warnings and errors · a0be55de
      Ionut Alexa 提交于
      Fixed coding style warnings and errors.
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0be55de
    • F
      kernel/test_kprobes.c: use current logging functions · 4878b14b
      Fabian Frederick 提交于
      - Add pr_fmt
      - Coalesce formats
      - Use current pr_foo() functions instead of printk
      - Remove unnecessary "failed" display (already in log level).
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4878b14b
    • N
      kernel/kallsyms.c: fix %pB when there's no symbol at the address · b86280aa
      Namhyung Kim 提交于
      __sprint_symbol() should restore original address when kallsyms_lookup()
      failed to find a symbol.  It's reported when dumpstack shows an address in
      a dynamically allocated trampoline for ftrace.
      
        [ 1314.612287]  [<ffffffff81700312>] dump_stack+0x45/0x56
        [ 1314.612290]  [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
        [ 1314.612293]  [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
        [ 1314.612306]  [<ffffffffa00160c4>] 0xffffffffa00160c3
      
      You can see a difference in the hex address - c4 and c3.  Fix it.
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Reported-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b86280aa
    • V
      page-cgroup: get rid of NR_PCG_FLAGS · 9a3f4d85
      Vladimir Davydov 提交于
      It's not used anywhere today, so let's remove it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a3f4d85
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  3. 07 8月, 2014 11 次提交
    • N
      kernel/printk/printk.c: fix bool assignements · d25d9fec
      Neil Zhang 提交于
      Fix coccinelle warnings.
      Signed-off-by: NNeil Zhang <zhangwm@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d25d9fec
    • J
      printk: enable interrupts before calling console_trylock_for_printk() · 5874af20
      Jan Kara 提交于
      We need interrupts disabled when calling console_trylock_for_printk()
      only so that cpu id we pass to can_use_console() remains valid (for
      other things console_sem provides all the exclusion we need and
      deadlocks on console_sem due to interrupts are impossible because we use
      down_trylock()).  However if we are rescheduled, we are guaranteed to
      run on an online cpu so we can easily just get the cpu id in
      can_use_console().
      
      We can lose a bit of performance when we enable interrupts in
      vprintk_emit() and then disable them again in console_unlock() but OTOH
      it can somewhat reduce interrupt latency caused by console_unlock().
      
      We differ from (reverted) commit 939f04be in that we avoid calling
      console_unlock() from vprintk_emit() with lockdep enabled as that has
      unveiled quite some bugs leading to system freezes during boot (e.g.
        https://lkml.org/lkml/2014/5/30/242,
        https://lkml.org/lkml/2014/6/28/521).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NAndreas Bombe <aeb@debian.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5874af20
    • A
      printk: miscellaneous cleanups · 249771b8
      Alex Elder 提交于
      Some small cleanups to kernel/printk/printk.c.  None of them should
      cause any change in behavior.
      
        - When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
        - When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
          definition; delete it.
        - Pull an assignment out of a conditional expression in console_setup().
        - Use isdigit() in console_setup() rather than open coding it.
        - In update_console_cmdline(), drop a NUL-termination assignment;
          the strlcpy() call that precedes it guarantees it's not needed.
        - Simplify some logic in printk_timed_ratelimit().
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      249771b8
    • A
      printk: use a clever macro · e99aa461
      Alex Elder 提交于
      Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
      global values.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e99aa461
    • A
      printk: fix some comments · 0b90fec3
      Alex Elder 提交于
      Fix a few comments that don't accurately describe their corresponding
      code.  It also fixes some minor typographical errors.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b90fec3
    • A
      printk: rename DEFAULT_MESSAGE_LOGLEVEL · 42a9dc0b
      Alex Elder 提交于
      Commit a8fe19eb ("kernel/printk: use symbolic defines for console
      loglevels") makes consistent use of symbolic values for printk() log
      levels.
      
      The naming scheme used is different from the one used for
      DEFAULT_MESSAGE_LOGLEVEL though.  Change that symbol name to be
      MESSAGE_LOGLEVEL_DEFAULT for consistency.  And because the value of that
      symbol comes from a similarly-named config option, rename
      CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a9dc0b
    • A
      printk: tweak do_syslog() to match comments · e97e1267
      Alex Elder 提交于
      In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
      only needs to know whether there's any data available to read (and not
      its size).  These callers only check for non-zero return.  As a
      shortcut, do_syslog() returns the difference between what has been
      logged and what has been "seen."
      
      The comments say that the "count of records" should be returned but it's
      not.  Instead it returns (log_next_idx - syslog_idx), which is a
      difference between buffer offsets--and the result could be negative.
      
      The behavior is the same (it'll be zero or not in the same cases), but
      the count of records is more meaningful and it matches what the comments
      say.  So change the code to return that.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97e1267
    • L
      printk: allow increasing the ring buffer depending on the number of CPUs · 23b2899f
      Luis R. Rodriguez 提交于
      The default size of the ring buffer is too small for machines with a
      large amount of CPUs under heavy load.  What ends up happening when
      debugging is the ring buffer overlaps and chews up old messages making
      debugging impossible unless the size is passed as a kernel parameter.
      An idle system upon boot up will on average spew out only about one or
      two extra lines but where this really matters is on heavy load and that
      will vary widely depending on the system and environment.
      
      There are mechanisms to help increase the kernel ring buffer for tracing
      through debugfs, and those interfaces even allow growing the kernel ring
      buffer per CPU.  We also have a static value which can be passed upon
      boot.  Relying on debugfs however is not ideal for production, and
      relying on the value passed upon bootup is can only used *after* an
      issue has creeped up.  Instead of being reactive this adds a proactive
      measure which lets you scale the amount of contributions you'd expect to
      the kernel ring buffer under load by each CPU in the worst case
      scenario.
      
      We use num_possible_cpus() to avoid complexities which could be
      introduced by dynamically changing the ring buffer size at run time,
      num_possible_cpus() lets us use the upper limit on possible number of
      CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
      This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
      which is used to specify the maximum amount of contributions to the
      kernel ring buffer in the worst case before the kernel ring buffer flips
      over, the size is specified as a power of 2.  The total amount of
      contributions made by each CPU must be greater than half of the default
      kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
      an increase upon bootup.  The kernel ring buffer is increased to the
      next power of two that would fit the required minimum kernel ring buffer
      size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
      is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
      in order to trigger an increase of the kernel ring buffer.  With a
      LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
      possible CPUs to trigger an increase.  If you had 128 possible CPUs the
      amount of minimum required kernel ring buffer bumps to:
      
         ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
      
      Since we require the ring buffer to be a power of two the new required
      size would be 1024 KB.
      
      This CPU contributions are ignored when the "log_buf_len" kernel
      parameter is used as it forces the exact size of the ring buffer to an
      expected power of two value.
      
      [pmladek@suse.cz: fix build]
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Tested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2899f
    • L
      printk: make dynamic units clear for the kernel ring buffer · f5405172
      Luis R. Rodriguez 提交于
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Suggested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5405172
    • L
      printk: move power of 2 practice of ring buffer size to a helper · c0a318a3
      Luis R. Rodriguez 提交于
      In practice the power of 2 practice of the size of the kernel ring
      buffer remains purely historical but not a requirement, specially now
      that we have LOG_ALIGN and use it for both static and dynamic
      allocations.  It could have helped with implicit alignment back in the
      days given the even the dynamically sized ring buffer was guaranteed to
      be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
      __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
      be allowed only if it was > __LOG_BUF_LEN and we always ended up
      rounding the log_buf_len=n to the next power of 2 with
      roundup_pow_of_two(), any multiple of 2 then should be also architecture
      aligned.  These assumptions of course relied heavily on
      CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
      change this.
      
      We now have precise alignment requirements set for the log buffer size
      for both static and dynamic allocations, but lets upkeep the old
      practice of using powers of 2 for its size to help with easy expected
      scalable values and the allocators for dynamic allocations.  We'll reuse
      this later so move this into a helper.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a318a3
    • L
      printk: make dynamic kernel ring buffer alignment explicit · 70300177
      Luis R. Rodriguez 提交于
      We have to consider alignment for the ring buffer both for the default
      static size, and then also for when an dynamic allocation is made when
      the log_buf_len=n kernel parameter is passed to set the size
      specifically to a size larger than the default size set by the
      architecture through CONFIG_LOG_BUF_SHIFT.
      
      The default static kernel ring buffer can be aligned properly if
      architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
      the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
      value it can be reduced to a non aligned value.  Commit 6ebb017d
      ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
      Lunn ensures the static buffer is always aligned and the decision of
      alignment is done by the compiler by using __alignof__(struct log).
      
      When log_buf_len=n is used we allocate the ring buffer dynamically.
      Dynamic allocation varies, for the early allocation called before
      setup_arch() memblock_virt_alloc() requests a page aligment and for the
      default kernel allocation memblock_virt_alloc_nopanic() requests no
      special alignment, which in turn ends up aligning the allocation to
      SMP_CACHE_BYTES, which is L1 cache aligned.
      
      Since we already have the required alignment for the kernel ring buffer
      though we can do better and request explicit alignment for LOG_ALIGN.
      This does that to be safe and make dynamic allocation alignment
      explicit.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70300177