1. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  2. 10 6月, 2020 1 次提交
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  3. 09 6月, 2020 1 次提交
    • V
      kernel/sysctl: support setting sysctl parameters from kernel command line · 3db978d4
      Vlastimil Babka 提交于
      Patch series "support setting sysctl parameters from kernel command line", v3.
      
      This series adds support for something that seems like many people
      always wanted but nobody added it yet, so here's the ability to set
      sysctl parameters via kernel command line options in the form of
      sysctl.vm.something=1
      
      The important part is Patch 1.  The second, not so important part is an
      attempt to clean up legacy one-off parameters that do the same thing as
      a sysctl.  I don't want to remove them completely for compatibility
      reasons, but with generic sysctl support the idea is to remove the
      one-off param handlers and treat the parameters as aliases for the
      sysctl variants.
      
      I have identified several parameters that mention sysctl counterparts in
      Documentation/admin-guide/kernel-parameters.txt but there might be more.
      The conversion also has varying level of success:
      
       - numa_zonelist_order is converted in Patch 2 together with adding the
         necessary infrastructure. It's easy as it doesn't really do anything
         but warn on deprecated value these days.
      
       - hung_task_panic is converted in Patch 3, but there's a downside that
         now it only accepts 0 and 1, while previously it was any integer
         value
      
       - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
         so there's no straighforward conversion possible
      
       - traceoff_on_warning is a flag without value and it would be required
         to handle that somehow in the conversion infractructure, which seems
         pointless for a single flag
      
      This patch (of 5):
      
      A recently proposed patch to add vm_swappiness command line parameter in
      addition to existing sysctl [1] made me wonder why we don't have a
      general support for passing sysctl parameters via command line.
      
      Googling found only somebody else wondering the same [2], but I haven't
      found any prior discussion with reasons why not to do this.
      
      Settings the vm_swappiness issue aside (the underlying issue might be
      solved in a different way), quick search of kernel-parameters.txt shows
      there are already some that exist as both sysctl and kernel parameter -
      hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
      
      A general mechanism would remove the need to add more of those one-offs
      and might be handy in situations where configuration by e.g.
      /etc/sysctl.d/ is impractical.
      
      Hence, this patch adds a new parse_args() pass that looks for parameters
      prefixed by 'sysctl.' and tries to interpret them as writes to the
      corresponding sys/ files using an temporary in-kernel procfs mount.
      This mechanism was suggested by Eric W.  Biederman [3], as it handles
      all dynamically registered sysctl tables, even though we don't handle
      modular sysctls.  Errors due to e.g.  invalid parameter name or value
      are reported in the kernel log.
      
      The processing is hooked right before the init process is loaded, as
      some handlers might be more complicated than simple setters and might
      need some subsystems to be initialized.  At the moment the init process
      can be started and eventually execute a process writing to /proc/sys/
      then it should be also fine to do that from the kernel.
      
      Sysctls registered later on module load time are not set by this
      mechanism - it's expected that in such scenarios, setting sysctl values
      from userspace is practical enough.
      
      [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
      [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
      [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
      Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3db978d4
  4. 05 6月, 2020 2 次提交
    • N
      Kconfig: add config option for asm goto w/ outputs · 587f1701
      Nick Desaulniers 提交于
      This allows C code to make use of compilers with support for output
      variables along the fallthrough path via preprocessor define:
      
        CONFIG_CC_HAS_ASM_GOTO_OUTPUT
      
      [ This is not used anywhere yet, and currently released compilers don't
        support this yet, but it's coming, and I have some local experimental
        patches to take advantage of it when it does   - Linus ]
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      587f1701
    • C
      init: allow distribution configuration of default init · ada4ab7a
      Chris Down 提交于
      Some init systems (eg.  systemd) have init at their own paths, for
      example, /usr/lib/systemd/systemd.  A compatibility symlink to one of the
      hardcoded init paths is provided by another package, usually named
      something like systemd-sysvcompat or similar.
      
      Currently distro maintainers who are hands-off on the bootloader are more
      or less required to include those compatibility links as part of their
      base distribution, because it's hard to migrate away from them since
      there's a risk some users will not get the message to set init= on the
      kernel command line appropriately.
      
      Moreover, for distributions where the init system is something the
      distribution itself is opinionated about (eg.  Arch, which has systemd in
      the required `base` package), we could usually reasonably configure this
      ahead of time when building the distribution kernel.  However, we
      currently simply don't have any way to configure the kernel to do this.
      Here's an example discussion where removing sysvcompat was discussed by
      distro maintainers[0].
      
      This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
      is tried before the hardcoded fallback list.  So the order of precedence
      is now thus:
      
      1. init= on command line (on failure: panic)
      2. CONFIG_DEFAULT_INIT (on failure: try #3)
      3. Hardcoded fallback list (on failure: panic)
      
      This new config parameter will allow distribution maintainers to move away
      from these compatibility links safely, without having to worry that their
      users might not have the right init=.
      
      There are also two other benefits of this over having the distribution
      maintain a symlink:
      
      1. One of the value propositions over simply having distributions
         maintain a /sbin/init symlink via a package is that it also frees
         distributions which have a preferred default, but not mandatory, init
         system from having their package manager fight with their users for
         control of /{s,}bin/init.  Instead, the distribution simply makes
         their preference known in CONFIG_DEFAULT_INIT, and if the user
         installs another init system and uninstalls the default one they can
         still make use of /{s,}bin/init and friends for their own uses. This
         makes more cases Just Work(tm) without the user having to perform
         extra configuration via init=.
      
      2. Since before this we don't know which path the distribution actually
         _intends_ to serve init from, we don't pr_err if it is simply
         missing, and usually will just silently put the user in a /bin/sh
         shell. Now that the distribution can make a declaration of intent, we
         can be more vocal when this init system fails to launch for any
         reason, even if it's simply because no file exists at that location,
         speeding up the palaver of init/mount dependency/etc debugging a bit.
      
      [0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.htmlSigned-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ada4ab7a
  5. 04 6月, 2020 2 次提交
    • J
      mm: memcontrol: make swap tracking an integral part of memory control · 2d1c4980
      Johannes Weiner 提交于
      Without swap page tracking, users that are otherwise memory controlled can
      easily escape their containment and allocate significant amounts of memory
      that they're not being charged for.  That's because swap does readahead,
      but without the cgroup records of who owned the page at swapout, readahead
      pages don't get charged until somebody actually faults them into their
      page table and we can identify an owner task.  This can be maliciously
      exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      Make swap swap page tracking an integral part of memcg and remove the
      Kconfig options.  In the first place, it was only made configurable to
      allow users to save some memory.  But the overhead of tracking cgroup
      ownership per swap page is minimal - 2 byte per page, or 512k per 1G of
      swap, or 0.04%.  Saving that at the expense of broken containment
      semantics is not something we should present as a coequal option.
      
      The swapaccount=0 boot option will continue to exist, and it will
      eliminate the page_counter overhead and hide the swap control files, but
      it won't disable swap slot ownership tracking.
      
      This patch makes sure we always have the cgroup records at swapin time;
      the next patch will fix the actual bug by charging readahead swap pages at
      swapin time rather than at fault time.
      
      v2: fix double swap charge bug in cgroup1/cgroup2 code gating
      
      [hannes@cmpxchg.org: fix crash with cgroup_disable=memory]
        Link: http://lkml.kernel.org/r/20200521215855.GB815153@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200508183105.225460-16-hannes@cmpxchg.orgDebugged-by: NHugh Dickins <hughd@google.com>
      Debugged-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d1c4980
    • D
      padata: initialize earlier · f1b192b1
      Daniel Jordan 提交于
      padata will soon initialize the system's struct pages in parallel, so it
      needs to be ready by page_alloc_init_late().
      
      The error return from padata_driver_init() triggers an initcall warning,
      so add a warning to padata_init() to avoid silent failure.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1b192b1
  6. 19 5月, 2020 1 次提交
    • D
      pipe: Add general notification queue support · c73be61c
      David Howells 提交于
      Make it possible to have a general notification queue built on top of a
      standard pipe.  Notifications are 'spliced' into the pipe and then read
      out.  splice(), vmsplice() and sendfile() are forbidden on pipes used for
      notifications as post_one_notification() cannot take pipe->mutex.  This
      means that notifications could be posted in between individual pipe
      buffers, making iov_iter_revert() difficult to effect.
      
      The way the notification queue is used is:
      
       (1) An application opens a pipe with a special flag and indicates the
           number of messages it wishes to be able to queue at once (this can
           only be set once):
      
      	pipe2(fds, O_NOTIFICATION_PIPE);
      	ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
      
       (2) The application then uses poll() and read() as normal to extract data
           from the pipe.  read() will return multiple notifications if the
           buffer is big enough, but it will not split a notification across
           buffers - rather it will return a short read or EMSGSIZE.
      
           Notification messages include a length in the header so that the
           caller can split them up.
      
      Each message has a header that describes it:
      
      	struct watch_notification {
      		__u32	type:24;
      		__u32	subtype:8;
      		__u32	info;
      	};
      
      The type indicates the source (eg. mount tree changes, superblock events,
      keyring changes, block layer events) and the subtype indicates the event
      type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
      indicates a number of things, including the entry length, an ID assigned to
      a watchpoint contributing to this buffer and type-specific flags.
      
      Supplementary data, such as the key ID that generated an event, can be
      attached in additional slots.  The maximum message size is 127 bytes.
      Messages may not be padded or aligned, so there is no guarantee, for
      example, that the notification type will be on a 4-byte bounary.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c73be61c
  7. 17 5月, 2020 2 次提交
    • M
      bpfilter: check if $(CC) can link static libc in Kconfig · b1183b6d
      Masahiro Yamada 提交于
      On Fedora, linking static glibc requires the glibc-static RPM package,
      which is not part of the glibc-devel package.
      
      CONFIG_CC_CAN_LINK does not check the capability of static linking,
      so you can enable CONFIG_BPFILTER_UMH, then fail to build:
      
          HOSTLD  net/bpfilter/bpfilter_umh
        /usr/bin/ld: cannot find -lc
        collect2: error: ld returned 1 exit status
      
      Add CONFIG_CC_CAN_LINK_STATIC, and make CONFIG_BPFILTER_UMH depend
      on it.
      Reported-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      b1183b6d
    • M
      bpfilter: match bit size of bpfilter_umh to that of the kernel · 9371f86e
      Masahiro Yamada 提交于
      bpfilter_umh is built for the default machine bit of the compiler,
      which may not match to the bit size of the kernel.
      
      This happens in the scenario below:
      
      You can use biarch GCC that defaults to 64-bit for building the 32-bit
      kernel. In this case, Kbuild passes -m32 to teach the compiler to
      produce 32-bit kernel space objects. However, it is missing when
      building bpfilter_umh. It is built as a 64-bit ELF, and then embedded
      into the 32-bit kernel.
      
      The 32-bit kernel and 64-bit umh is a bad combination.
      
      In theory, we can have 32-bit umh running on 64-bit kernel, but we do
      not have a good reason to support such a usecase.
      
      The best is to match the bit size between them.
      
      Pass -m32 or -m64 to the umh build command if it is found in
      $(KBUILD_CFLAGS). Evaluate CC_CAN_LINK against the kernel bit-size.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      9371f86e
  8. 15 5月, 2020 3 次提交
    • S
      scs: Add support for Clang's Shadow Call Stack (SCS) · d08b9f0c
      Sami Tolvanen 提交于
      This change adds generic support for Clang's Shadow Call Stack,
      which uses a shadow stack to protect return addresses from being
      overwritten by an attacker. Details are available here:
      
        https://clang.llvm.org/docs/ShadowCallStack.html
      
      Note that security guarantees in the kernel differ from the ones
      documented for user space. The kernel must store addresses of
      shadow stacks in memory, which means an attacker capable reading
      and writing arbitrary memory may be able to locate them and hijack
      control flow by modifying the stacks.
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [will: Numerous cosmetic changes]
      Signed-off-by: NWill Deacon <will@kernel.org>
      d08b9f0c
    • D
      bpf: Restrict bpf_probe_read{, str}() only to archs where they work · 0ebeea8c
      Daniel Borkmann 提交于
      Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
      with overlapping address ranges, we should really take the next step to
      disable them from BPF use there.
      
      To generally fix the situation, we've recently added new helper variants
      bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
      For details on them, see 6ae08ae3 ("bpf: Add probe_read_{user, kernel}
      and probe_read_{user,kernel}_str helpers").
      
      Given bpf_probe_read{,str}() have been around for ~5 years by now, there
      are plenty of users at least on x86 still relying on them today, so we
      cannot remove them entirely w/o breaking the BPF tracing ecosystem.
      
      However, their use should be restricted to archs with non-overlapping
      address ranges where they are working in their current form. Therefore,
      move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
      have x86, arm64, arm select it (other archs supporting it can follow-up
      on it as well).
      
      For the remaining archs, they can workaround easily by relying on the
      feature probe from bpftool which spills out defines that can be used out
      of BPF C code to implement the drop-in replacement for old/new kernels
      via: bpftool feature probe macro
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
      0ebeea8c
    • B
      x86: Fix early boot crash on gcc-10, third try · a9a3ed1e
      Borislav Petkov 提交于
      ... or the odyssey of trying to disable the stack protector for the
      function which generates the stack canary value.
      
      The whole story started with Sergei reporting a boot crash with a kernel
      built with gcc-10:
      
        Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
        Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
        Call Trace:
          dump_stack
          panic
          ? start_secondary
          __stack_chk_fail
          start_secondary
          secondary_startup_64
        -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
      
      This happens because gcc-10 tail-call optimizes the last function call
      in start_secondary() - cpu_startup_entry() - and thus emits a stack
      canary check which fails because the canary value changes after the
      boot_init_stack_canary() call.
      
      To fix that, the initial attempt was to mark the one function which
      generates the stack canary with:
      
        __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
      
      however, using the optimize attribute doesn't work cumulatively
      as the attribute does not add to but rather replaces previously
      supplied optimization options - roughly all -fxxx options.
      
      The key one among them being -fno-omit-frame-pointer and thus leading to
      not present frame pointer - frame pointer which the kernel needs.
      
      The next attempt to prevent compilers from tail-call optimizing
      the last function call cpu_startup_entry(), shy of carving out
      start_secondary() into a separate compilation unit and building it with
      -fno-stack-protector, was to add an empty asm("").
      
      This current solution was short and sweet, and reportedly, is supported
      by both compilers but we didn't get very far this time: future (LTO?)
      optimization passes could potentially eliminate this, which leads us
      to the third attempt: having an actual memory barrier there which the
      compiler cannot ignore or move around etc.
      
      That should hold for a long time, but hey we said that about the other
      two solutions too so...
      Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
      a9a3ed1e
  9. 12 5月, 2020 5 次提交
  10. 10 5月, 2020 2 次提交
    • L
      gcc-10: mark more functions __init to avoid section mismatch warnings · e99332e7
      Linus Torvalds 提交于
      It seems that for whatever reason, gcc-10 ends up not inlining a couple
      of functions that used to be inlined before.  Even if they only have one
      single callsite - it looks like gcc may have decided that the code was
      unlikely, and not worth inlining.
      
      The code generation difference is harmless, but caused a few new section
      mismatch errors, since the (now no longer inlined) function wasn't in
      the __init section, but called other init functions:
      
         Section mismatch in reference from the function kexec_free_initrd() to the function .init.text:free_initrd_mem()
         Section mismatch in reference from the function tpm2_calc_event_log_size() to the function .init.text:early_memremap()
         Section mismatch in reference from the function tpm2_calc_event_log_size() to the function .init.text:early_memunmap()
      
      So add the appropriate __init annotation to make modpost not complain.
      In both cases there were trivially just a single callsite from another
      __init function.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e99332e7
    • L
      Stop the ad-hoc games with -Wno-maybe-initialized · 78a5255f
      Linus Torvalds 提交于
      We have some rather random rules about when we accept the
      "maybe-initialized" warnings, and when we don't.
      
      For example, we consider it unreliable for gcc versions < 4.9, but also
      if -O3 is enabled, or if optimizing for size.  And then various kernel
      config options disabled it, because they know that they trigger that
      warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES).
      
      And now gcc-10 seems to be introducing a lot of those warnings too, so
      it falls under the same heading as 4.9 did.
      
      At the same time, we have a very straightforward way to _enable_ that
      warning when wanted: use "W=2" to enable more warnings.
      
      So stop playing these ad-hoc games, and just disable that warning by
      default, with the known and straight-forward "if you want to work on the
      extra compiler warnings, use W=123".
      
      Would it be great to have code that is always so obvious that it never
      confuses the compiler whether a variable is used initialized or not?
      Yes, it would.  In a perfect world, the compilers would be smarter, and
      our source code would be simpler.
      
      That's currently not the world we live in, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78a5255f
  11. 06 5月, 2020 1 次提交
  12. 28 4月, 2020 2 次提交
    • P
      rcu-tasks: Split ->trc_reader_need_end · 276c4104
      Paul E. McKenney 提交于
      This commit splits ->trc_reader_need_end by using the rcu_special union.
      This change permits readers to check to see if a memory barrier is
      required without any added overhead in the common case where no such
      barrier is required.  This commit also adds the read-side checking.
      Later commits will add the machinery to properly set the new
      ->trc_reader_special.b.need_mb field.
      
      This commit also makes rcu_read_unlock_trace_special() tolerate nested
      read-side critical sections within interrupt and NMI handlers.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      276c4104
    • P
      rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks · d5f177d3
      Paul E. McKenney 提交于
      Because RCU does not watch exception early-entry/late-exit, idle-loop,
      or CPU-hotplug execution, protection of tracing and BPF operations is
      needlessly complicated.  This commit therefore adds a variant of
      Tasks RCU that:
      
      o	Has explicit read-side markers to allow finite grace periods in
      	the face of in-kernel loops for PREEMPT=n builds.  These markers
      	are rcu_read_lock_trace() and rcu_read_unlock_trace().
      
      o	Protects code in the idle loop, exception entry/exit, and
      	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
      	similar to SRCU, but with lighter-weight readers.
      
      o	Avoids expensive read-side instruction, having overhead similar
      	to that of Preemptible RCU.
      
      There are of course downsides:
      
      o	The grace-period code can send IPIs to CPUs, even when those
      	CPUs are in the idle loop or in nohz_full userspace.  This is
      	mitigated by later commits.
      
      o	It is necessary to scan the full tasklist, much as for Tasks RCU.
      
      o	There is a single callback queue guarded by a single lock,
      	again, much as for Tasks RCU.  However, those early use cases
      	that request multiple grace periods in quick succession are
      	expected to do so from a single task, which makes the single
      	lock almost irrelevant.  If needed, multiple callback queues
      	can be provided using any number of schemes.
      
      Perhaps most important, this variant of RCU does not affect the vanilla
      flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
      readers can operate from idle, offline, and exception entry/exit in no
      way enables rcu_preempt and rcu_sched readers to do so.
      
      The memory ordering was outlined here:
      https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
      
      This effort benefited greatly from off-list discussions of BPF
      requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
      some of the on-list discussions are captured in the Link: tags below.
      In addition, KCSAN was quite helpful in finding some early bugs.
      
      Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
      Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
      [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
      [ paulmck: Fix locking issue reported by rcutorture. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      d5f177d3
  13. 27 4月, 2020 1 次提交
    • R
      x86/setup: Add an initrdmem= option to specify initrd physical address · 694cfd87
      Ronald G. Minnich 提交于
      Add the initrdmem option:
      
        initrdmem=ss[KMG],nn[KMG]
      
      which is used to specify the physical address of the initrd, almost
      always an address in FLASH. Also add code for x86 to use the existing
      phys_init_start and phys_init_size variables in the kernel.
      
      This is useful in cases where a kernel and an initrd is placed in FLASH,
      but there is no firmware file system structure in the FLASH.
      
      One such situation occurs when unused FLASH space on UEFI systems has
      been reclaimed by, e.g., taking it from the Management Engine. For
      example, on many systems, the ME is given half the FLASH part; not only
      is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
      space can be used to contain an initrd, but need to tell Linux where it
      is.
      
      This space is "raw": due to, e.g., UEFI limitations: it can not be added
      to UEFI firmware volumes without rebuilding UEFI from source or writing
      a UEFI device driver. It can be referenced only as a physical address
      and size.
      
      At the same time, if a kernel can be "netbooted" or loaded from GRUB or
      syslinux, the option of not using the physical address specification
      should be available.
      
      Then, it is easy to boot the kernel and provide an initrd; or boot the
      the kernel and let it use the initrd in FLASH. In practice, this has
      proven to be very helpful when integrating Linux into FLASH on x86.
      
      Hence, the most flexible and convenient path is to enable the initrdmem
      command line option in a way that it is the last choice tried.
      
      For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
      burnt in with a built-in command line which includes:
      
        initrdmem=0xff968000,0x200000
      
      which specifies a location and size.
      
       [ bp: Massage commit message, make it passive. ]
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NRonald G. Minnich <rminnich@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NH. Peter Anvin (Intel) <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
      Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
      694cfd87
  14. 16 4月, 2020 1 次提交
  15. 14 4月, 2020 1 次提交
    • M
      kcsan: Add support for scoped accesses · 757a4cef
      Marco Elver 提交于
      This adds support for scoped accesses, where the memory range is checked
      for the duration of the scope. The feature is implemented by inserting
      the relevant access information into a list of scoped accesses for
      the current execution context, which are then checked (until removed)
      on every call (through instrumentation) into the KCSAN runtime.
      
      An alternative, more complex, implementation could set up a watchpoint for
      the scoped access, and keep the watchpoint set up. This, however, would
      require first exposing a handle to the watchpoint, as well as dealing
      with cases such as accesses by the same thread while the watchpoint is
      still set up (and several more cases). It is also doubtful if this would
      provide any benefit, since the majority of delay where the watchpoint
      is set up is likely due to the injected delays by KCSAN.  Therefore,
      the implementation in this patch is simpler and avoids hurting KCSAN's
      main use-case (normal data race detection); it also implicitly increases
      scoped-access race-detection-ability due to increased probability of
      setting up watchpoints by repeatedly calling __kcsan_check_access()
      throughout the scope of the access.
      
      The implementation required adding an additional conditional branch to
      the fast-path. However, the microbenchmark showed a *speedup* of ~5%
      on the fast-path. This appears to be due to subtly improved codegen by
      GCC from moving get_ctx() and associated load of preempt_count earlier.
      Suggested-by: NBoqun Feng <boqun.feng@gmail.com>
      Suggested-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      757a4cef
  16. 11 4月, 2020 1 次提交
  17. 08 4月, 2020 4 次提交
  18. 01 4月, 2020 1 次提交
  19. 31 3月, 2020 1 次提交
  20. 30 3月, 2020 1 次提交
  21. 27 3月, 2020 1 次提交
  22. 25 3月, 2020 1 次提交
  23. 24 3月, 2020 1 次提交
    • C
      block: remove __bdevname · ea3edd4d
      Christoph Hellwig 提交于
      There is no good reason for __bdevname to exist.  Just open code
      printing the string in the callers.  For three of them the format
      string can be trivially merged into existing printk statements,
      and in init/do_mounts.c we can at least do the scnprintf once at
      the start of the function, and unconditional of CONFIG_BLOCK to
      make the output for tiny configfs a little more helpful.
      
      Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea3edd4d
  24. 21 3月, 2020 1 次提交
  25. 12 3月, 2020 1 次提交
    • M
      int128: fix __uint128_t compiler test in Kconfig · 3a7c7331
      Masahiro Yamada 提交于
      The support for __uint128_t is dependent on the target bit size.
      
      GCC that defaults to the 32-bit can still build the 64-bit kernel
      with -m64 flag passed.
      
      However, $(cc-option,-D__SIZEOF_INT128__=0) is evaluated against the
      default machine bit, which may not match to the kernel it is building.
      
      Theoretically, this could be evaluated separately for 64BIT/32BIT.
      
        config CC_HAS_INT128
                bool
                default !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) if 64BIT
                default !$(cc-option,$(m32-flag) -D__SIZEOF_INT128__=0)
      
      I simplified it more because the 32-bit compiler is unlikely to support
      __uint128_t.
      
      Fixes: c12d3362 ("int128: move __uint128_t compiler test to Kconfig")
      Reported-by: NGeorge Spelvin <lkml@sdf.org>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Tested-by: NGeorge Spelvin <lkml@sdf.org>
      3a7c7331
  26. 06 3月, 2020 1 次提交