1. 25 9月, 2020 1 次提交
  2. 08 9月, 2020 1 次提交
  3. 29 8月, 2020 1 次提交
    • A
      bpf: Introduce sleepable BPF programs · 1e6c62a8
      Alexei Starovoitov 提交于
      Introduce sleepable BPF programs that can request such property for themselves
      via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
      to use helpers like bpf_copy_from_user() that might sleep. At present only
      fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
      when they are attached to kernel functions that are known to allow sleeping.
      
      The non-sleepable programs are relying on implicit rcu_read_lock() and
      migrate_disable() to protect life time of programs, maps that they use and
      per-cpu kernel structures used to pass info between bpf programs and the
      kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
      migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
      should not be enclosed in migrate_disable() as well. Therefore
      rcu_read_lock_trace is used to protect the life time of sleepable progs.
      
      There are many networking and tracing program types. In many cases the
      'struct bpf_prog *' pointer itself is rcu protected within some other kernel
      data structure and the kernel code is using rcu_dereference() to load that
      program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
      Instead sleepable bpf programs are allowed with bpf trampoline only. The
      program pointers are hard-coded into generated assembly of bpf trampoline and
      synchronize_rcu_tasks_trace() is used to protect the life time of the program.
      The same trampoline can hold both sleepable and non-sleepable progs.
      
      When rcu_read_lock_trace is held it means that some sleepable bpf program is
      running from bpf trampoline. Those programs can use bpf arrays and preallocated
      hash/lru maps. These map types are waiting on programs to complete via
      synchronize_rcu_tasks_trace();
      
      Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
      synchronize_rcu_tasks() to wait for sleepable progs to finish and for
      trampoline assembly to finish.
      
      This is the first step of introducing sleepable progs. Eventually dynamically
      allocated hash maps can be allowed and networking program types can become
      sleepable too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
      1e6c62a8
  4. 20 8月, 2020 1 次提交
    • A
      bpf: Add kernel module with user mode driver that populates bpffs. · d71fa5c9
      Alexei Starovoitov 提交于
      Add kernel module with user mode driver that populates bpffs with
      BPF iterators.
      
      $ mount bpffs /my/bpffs/ -t bpf
      $ ls -la /my/bpffs/
      total 4
      drwxrwxrwt  2 root root    0 Jul  2 00:27 .
      drwxr-xr-x 19 root root 4096 Jul  2 00:09 ..
      -rw-------  1 root root    0 Jul  2 00:27 maps.debug
      -rw-------  1 root root    0 Jul  2 00:27 progs.debug
      
      The user mode driver will load BPF Type Formats, create BPF maps, populate BPF
      maps, load two BPF programs, attach them to BPF iterators, and finally send two
      bpf_link IDs back to the kernel.
      The kernel will pin two bpf_links into newly mounted bpffs instance under
      names "progs.debug" and "maps.debug". These two files become human readable.
      
      $ cat /my/bpffs/progs.debug
        id name            attached
        11 dump_bpf_map    bpf_iter_bpf_map
        12 dump_bpf_prog   bpf_iter_bpf_prog
        27 test_pkt_access
        32 test_main       test_pkt_access test_pkt_access
        33 test_subprog1   test_pkt_access_subprog1 test_pkt_access
        34 test_subprog2   test_pkt_access_subprog2 test_pkt_access
        35 test_subprog3   test_pkt_access_subprog3 test_pkt_access
        36 new_get_skb_len get_skb_len test_pkt_access
        37 new_get_skb_ifindex get_skb_ifindex test_pkt_access
        38 new_get_constant get_constant test_pkt_access
      
      The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about
      all BPF programs currently loaded in the system. This information is unstable
      and will change from kernel to kernel as ".debug" suffix conveys.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com
      d71fa5c9
  5. 08 8月, 2020 1 次提交
  6. 31 7月, 2020 1 次提交
  7. 29 7月, 2020 1 次提交
  8. 22 7月, 2020 1 次提交
  9. 01 7月, 2020 1 次提交
  10. 27 6月, 2020 1 次提交
  11. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  12. 05 6月, 2020 2 次提交
    • N
      Kconfig: add config option for asm goto w/ outputs · 587f1701
      Nick Desaulniers 提交于
      This allows C code to make use of compilers with support for output
      variables along the fallthrough path via preprocessor define:
      
        CONFIG_CC_HAS_ASM_GOTO_OUTPUT
      
      [ This is not used anywhere yet, and currently released compilers don't
        support this yet, but it's coming, and I have some local experimental
        patches to take advantage of it when it does   - Linus ]
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      587f1701
    • C
      init: allow distribution configuration of default init · ada4ab7a
      Chris Down 提交于
      Some init systems (eg.  systemd) have init at their own paths, for
      example, /usr/lib/systemd/systemd.  A compatibility symlink to one of the
      hardcoded init paths is provided by another package, usually named
      something like systemd-sysvcompat or similar.
      
      Currently distro maintainers who are hands-off on the bootloader are more
      or less required to include those compatibility links as part of their
      base distribution, because it's hard to migrate away from them since
      there's a risk some users will not get the message to set init= on the
      kernel command line appropriately.
      
      Moreover, for distributions where the init system is something the
      distribution itself is opinionated about (eg.  Arch, which has systemd in
      the required `base` package), we could usually reasonably configure this
      ahead of time when building the distribution kernel.  However, we
      currently simply don't have any way to configure the kernel to do this.
      Here's an example discussion where removing sysvcompat was discussed by
      distro maintainers[0].
      
      This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
      is tried before the hardcoded fallback list.  So the order of precedence
      is now thus:
      
      1. init= on command line (on failure: panic)
      2. CONFIG_DEFAULT_INIT (on failure: try #3)
      3. Hardcoded fallback list (on failure: panic)
      
      This new config parameter will allow distribution maintainers to move away
      from these compatibility links safely, without having to worry that their
      users might not have the right init=.
      
      There are also two other benefits of this over having the distribution
      maintain a symlink:
      
      1. One of the value propositions over simply having distributions
         maintain a /sbin/init symlink via a package is that it also frees
         distributions which have a preferred default, but not mandatory, init
         system from having their package manager fight with their users for
         control of /{s,}bin/init.  Instead, the distribution simply makes
         their preference known in CONFIG_DEFAULT_INIT, and if the user
         installs another init system and uninstalls the default one they can
         still make use of /{s,}bin/init and friends for their own uses. This
         makes more cases Just Work(tm) without the user having to perform
         extra configuration via init=.
      
      2. Since before this we don't know which path the distribution actually
         _intends_ to serve init from, we don't pr_err if it is simply
         missing, and usually will just silently put the user in a /bin/sh
         shell. Now that the distribution can make a declaration of intent, we
         can be more vocal when this init system fails to launch for any
         reason, even if it's simply because no file exists at that location,
         speeding up the palaver of init/mount dependency/etc debugging a bit.
      
      [0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.htmlSigned-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ada4ab7a
  13. 04 6月, 2020 1 次提交
    • J
      mm: memcontrol: make swap tracking an integral part of memory control · 2d1c4980
      Johannes Weiner 提交于
      Without swap page tracking, users that are otherwise memory controlled can
      easily escape their containment and allocate significant amounts of memory
      that they're not being charged for.  That's because swap does readahead,
      but without the cgroup records of who owned the page at swapout, readahead
      pages don't get charged until somebody actually faults them into their
      page table and we can identify an owner task.  This can be maliciously
      exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      Make swap swap page tracking an integral part of memcg and remove the
      Kconfig options.  In the first place, it was only made configurable to
      allow users to save some memory.  But the overhead of tracking cgroup
      ownership per swap page is minimal - 2 byte per page, or 512k per 1G of
      swap, or 0.04%.  Saving that at the expense of broken containment
      semantics is not something we should present as a coequal option.
      
      The swapaccount=0 boot option will continue to exist, and it will
      eliminate the page_counter overhead and hide the swap control files, but
      it won't disable swap slot ownership tracking.
      
      This patch makes sure we always have the cgroup records at swapin time;
      the next patch will fix the actual bug by charging readahead swap pages at
      swapin time rather than at fault time.
      
      v2: fix double swap charge bug in cgroup1/cgroup2 code gating
      
      [hannes@cmpxchg.org: fix crash with cgroup_disable=memory]
        Link: http://lkml.kernel.org/r/20200521215855.GB815153@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200508183105.225460-16-hannes@cmpxchg.orgDebugged-by: NHugh Dickins <hughd@google.com>
      Debugged-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d1c4980
  14. 19 5月, 2020 1 次提交
    • D
      pipe: Add general notification queue support · c73be61c
      David Howells 提交于
      Make it possible to have a general notification queue built on top of a
      standard pipe.  Notifications are 'spliced' into the pipe and then read
      out.  splice(), vmsplice() and sendfile() are forbidden on pipes used for
      notifications as post_one_notification() cannot take pipe->mutex.  This
      means that notifications could be posted in between individual pipe
      buffers, making iov_iter_revert() difficult to effect.
      
      The way the notification queue is used is:
      
       (1) An application opens a pipe with a special flag and indicates the
           number of messages it wishes to be able to queue at once (this can
           only be set once):
      
      	pipe2(fds, O_NOTIFICATION_PIPE);
      	ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
      
       (2) The application then uses poll() and read() as normal to extract data
           from the pipe.  read() will return multiple notifications if the
           buffer is big enough, but it will not split a notification across
           buffers - rather it will return a short read or EMSGSIZE.
      
           Notification messages include a length in the header so that the
           caller can split them up.
      
      Each message has a header that describes it:
      
      	struct watch_notification {
      		__u32	type:24;
      		__u32	subtype:8;
      		__u32	info;
      	};
      
      The type indicates the source (eg. mount tree changes, superblock events,
      keyring changes, block layer events) and the subtype indicates the event
      type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
      indicates a number of things, including the entry length, an ID assigned to
      a watchpoint contributing to this buffer and type-specific flags.
      
      Supplementary data, such as the key ID that generated an event, can be
      attached in additional slots.  The maximum message size is 127 bytes.
      Messages may not be padded or aligned, so there is no guarantee, for
      example, that the notification type will be on a 4-byte bounary.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c73be61c
  15. 17 5月, 2020 2 次提交
    • M
      bpfilter: check if $(CC) can link static libc in Kconfig · b1183b6d
      Masahiro Yamada 提交于
      On Fedora, linking static glibc requires the glibc-static RPM package,
      which is not part of the glibc-devel package.
      
      CONFIG_CC_CAN_LINK does not check the capability of static linking,
      so you can enable CONFIG_BPFILTER_UMH, then fail to build:
      
          HOSTLD  net/bpfilter/bpfilter_umh
        /usr/bin/ld: cannot find -lc
        collect2: error: ld returned 1 exit status
      
      Add CONFIG_CC_CAN_LINK_STATIC, and make CONFIG_BPFILTER_UMH depend
      on it.
      Reported-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      b1183b6d
    • M
      bpfilter: match bit size of bpfilter_umh to that of the kernel · 9371f86e
      Masahiro Yamada 提交于
      bpfilter_umh is built for the default machine bit of the compiler,
      which may not match to the bit size of the kernel.
      
      This happens in the scenario below:
      
      You can use biarch GCC that defaults to 64-bit for building the 32-bit
      kernel. In this case, Kbuild passes -m32 to teach the compiler to
      produce 32-bit kernel space objects. However, it is missing when
      building bpfilter_umh. It is built as a 64-bit ELF, and then embedded
      into the 32-bit kernel.
      
      The 32-bit kernel and 64-bit umh is a bad combination.
      
      In theory, we can have 32-bit umh running on 64-bit kernel, but we do
      not have a good reason to support such a usecase.
      
      The best is to match the bit size between them.
      
      Pass -m32 or -m64 to the umh build command if it is found in
      $(KBUILD_CFLAGS). Evaluate CC_CAN_LINK against the kernel bit-size.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      9371f86e
  16. 15 5月, 2020 1 次提交
  17. 12 5月, 2020 3 次提交
  18. 10 5月, 2020 1 次提交
    • L
      Stop the ad-hoc games with -Wno-maybe-initialized · 78a5255f
      Linus Torvalds 提交于
      We have some rather random rules about when we accept the
      "maybe-initialized" warnings, and when we don't.
      
      For example, we consider it unreliable for gcc versions < 4.9, but also
      if -O3 is enabled, or if optimizing for size.  And then various kernel
      config options disabled it, because they know that they trigger that
      warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES).
      
      And now gcc-10 seems to be introducing a lot of those warnings too, so
      it falls under the same heading as 4.9 did.
      
      At the same time, we have a very straightforward way to _enable_ that
      warning when wanted: use "W=2" to enable more warnings.
      
      So stop playing these ad-hoc games, and just disable that warning by
      default, with the known and straight-forward "if you want to work on the
      extra compiler warnings, use W=123".
      
      Would it be great to have code that is always so obvious that it never
      confuses the compiler whether a variable is used initialized or not?
      Yes, it would.  In a perfect world, the compilers would be smarter, and
      our source code would be simpler.
      
      That's currently not the world we live in, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78a5255f
  19. 16 4月, 2020 1 次提交
  20. 08 4月, 2020 2 次提交
  21. 01 4月, 2020 1 次提交
  22. 31 3月, 2020 1 次提交
  23. 30 3月, 2020 1 次提交
  24. 27 3月, 2020 1 次提交
  25. 12 3月, 2020 1 次提交
    • M
      int128: fix __uint128_t compiler test in Kconfig · 3a7c7331
      Masahiro Yamada 提交于
      The support for __uint128_t is dependent on the target bit size.
      
      GCC that defaults to the 32-bit can still build the 64-bit kernel
      with -m64 flag passed.
      
      However, $(cc-option,-D__SIZEOF_INT128__=0) is evaluated against the
      default machine bit, which may not match to the kernel it is building.
      
      Theoretically, this could be evaluated separately for 64BIT/32BIT.
      
        config CC_HAS_INT128
                bool
                default !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) if 64BIT
                default !$(cc-option,$(m32-flag) -D__SIZEOF_INT128__=0)
      
      I simplified it more because the 32-bit compiler is unlikely to support
      __uint128_t.
      
      Fixes: c12d3362 ("int128: move __uint128_t compiler test to Kconfig")
      Reported-by: NGeorge Spelvin <lkml@sdf.org>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Tested-by: NGeorge Spelvin <lkml@sdf.org>
      3a7c7331
  26. 06 3月, 2020 1 次提交
  27. 03 3月, 2020 2 次提交
    • Q
      kbuild: allow symbol whitelisting with TRIM_UNUSED_KSYMS · 1518c633
      Quentin Perret 提交于
      CONFIG_TRIM_UNUSED_KSYMS currently removes all unused exported symbols
      from ksymtab. This works really well when using in-tree drivers, but
      cannot be used in its current form if some of them are out-of-tree.
      
      Indeed, even if the list of symbols required by out-of-tree drivers is
      known at compile time, the only solution today to guarantee these don't
      get trimmed is to set CONFIG_TRIM_UNUSED_KSYMS=n. This not only wastes
      space, but also makes it difficult to control the ABI usable by vendor
      modules in distribution kernels such as Android. Being able to control
      the kernel ABI surface is particularly useful to ship a unique Generic
      Kernel Image (GKI) for all vendors, which is a first step in the
      direction of getting all vendors to contribute their code upstream.
      
      As such, attempt to improve the situation by enabling users to specify a
      symbol 'whitelist' at compile time. Any symbol specified in this
      whitelist will be kept exported when CONFIG_TRIM_UNUSED_KSYMS is set,
      even if it has no in-tree user. The whitelist is defined as a simple
      text file, listing symbols, one per line.
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NNicolas Pitre <nico@fluxnic.net>
      Tested-by: NMatthias Maennich <maennich@google.com>
      Reviewed-by: NMatthias Maennich <maennich@google.com>
      Signed-off-by: NQuentin Perret <qperret@google.com>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      1518c633
    • M
      kbuild: use KBUILD_DEFCONFIG as the fallback for DEFCONFIG_LIST · 2a86f661
      Masahiro Yamada 提交于
      Most of the Kconfig commands (except defconfig and all*config) read
      the .config file as a base set of CONFIG options.
      
      When it does not exist, the files in DEFCONFIG_LIST are searched in
      this order and loaded if found.
      
      I do not see much sense in the last two lines in DEFCONFIG_LIST.
      
      [1] ARCH_DEFCONFIG
      
      The entry for DEFCONFIG_LIST is guarded by 'depends on !UML'. So, the
      ARCH_DEFCONFIG definition in arch/x86/um/Kconfig is meaningless.
      
      arch/{sh,sparc,x86}/Kconfig define ARCH_DEFCONFIG depending on 32 or
      64 bit variant symbols. This is a little bit strange; ARCH_DEFCONFIG
      should be a fixed string because the base config file is loaded before
      the symbol evaluation stage.
      
      Using KBUILD_DEFCONFIG makes more sense because it is fixed before
      Kconfig is invoked. Fortunately, arch/{sh,sparc,x86}/Makefile define it
      in the same way, and it works as expected. Hence, replace ARCH_DEFCONFIG
      with "arch/$(SRCARCH)/configs/$(KBUILD_DEFCONFIG)".
      
      [2] arch/$(ARCH)/defconfig
      
      This file path is no longer valid. The defconfig files are always located
      in the arch configs/ directories.
      
        $ find arch -name defconfig | sort
        arch/alpha/configs/defconfig
        arch/arm64/configs/defconfig
        arch/csky/configs/defconfig
        arch/nds32/configs/defconfig
        arch/riscv/configs/defconfig
        arch/s390/configs/defconfig
        arch/unicore32/configs/defconfig
      
      The path arch/*/configs/defconfig is already covered by
      "arch/$(SRCARCH)/configs/$(KBUILD_DEFCONFIG)". So, this file path is
      not necessary.
      
      I moved the default KBUILD_DEFCONFIG to the top Makefile. Otherwise,
      the 7 architectures listed above would end up with endless loop of
      syncconfig.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      2a86f661
  28. 26 2月, 2020 1 次提交
  29. 21 2月, 2020 2 次提交
  30. 11 2月, 2020 1 次提交
  31. 22 1月, 2020 1 次提交
  32. 20 1月, 2020 1 次提交
    • J
      Revert "um: Enable CONFIG_CONSTRUCTORS" · 87c9366e
      Johannes Berg 提交于
      This reverts commit 786b2384 ("um: Enable CONFIG_CONSTRUCTORS").
      
      There are two issues with this commit, uncovered by Anton in tests
      on some (Debian) systems:
      
      1) I completely forgot to call any constructors if CONFIG_CONSTRUCTORS
         isn't set. Don't recall now if it just wasn't needed on my system, or
         if I never tested this case.
      
      2) With that fixed, it works - with CONFIG_CONSTRUCTORS *unset*. If I
         set CONFIG_CONSTRUCTORS, it fails again, which isn't totally
         unexpected since whatever wanted to run is likely to have to run
         before the kernel init etc. that calls the constructors in this case.
      
      Basically, some constructors that gcc emits (libc has?) need to run
      very early during init; the failure mode otherwise was that the ptrace
      fork test already failed:
      
      ----------------------
      $ ./linux mem=512M
      Core dump limits :
      	soft - 0
      	hard - NONE
      Checking that ptrace can change system call numbers...check_ptrace : child exited with exitcode 6, while expecting 0; status 0x67f
      Aborted
      ----------------------
      
      Thinking more about this, it's clear that we simply cannot support
      CONFIG_CONSTRUCTORS in UML. All the cases we need now (gcov, kasan)
      involve not use of the __attribute__((constructor)), but instead
      some constructor code/entry generated by gcc. Therefore, we cannot
      distinguish between kernel constructors and system constructors.
      
      Thus, revert this commit.
      
      Cc: stable@vger.kernel.org [5.4+]
      Fixes: 786b2384 ("um: Enable CONFIG_CONSTRUCTORS")
      Reported-by: NAnton Ivanov <anton.ivanov@cambridgegreys.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NAnton Ivanov <anton.ivanov@cambridgegreys.co.uk>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      87c9366e
  33. 14 1月, 2020 1 次提交
    • T
      lib/vdso: Prepare for time namespace support · 660fd04f
      Thomas Gleixner 提交于
      To support time namespaces in the vdso with a minimal impact on regular non
      time namespace affected tasks, the namespace handling needs to be hidden in
      a slow path.
      
      The most obvious place is vdso_seq_begin(). If a task belongs to a time
      namespace then the VVAR page which contains the system wide vdso data is
      replaced with a namespace specific page which has the same layout as the
      VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path
      and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time
      namespace handling path.
      
      The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
      update of the vdso data is in progress, is not really affecting regular
      tasks which are not part of a time namespace as the task is spin waiting
      for the update to finish and vdso_data->seq to become even again.
      
      If a time namespace task hits that code path, it invokes the corresponding
      time getter function which retrieves the real VVAR page, reads host time
      and then adds the offset for the requested clock which is stored in the
      special VVAR page.
      
      If VDSO time namespace support is disabled the whole magic is compiled out.
      
      Initial testing shows that the disabled case is almost identical to the
      host case which does not take the slow timens path. With the special timens
      page installed the performance hit is constant time and in the range of
      5-7%.
      
      For the vdso functions which are not using the sequence count an
      unconditional check for vdso_data->clock_mode is added which switches to
      the real vdso when the clock_mode is VCLOCK_TIMENS.
      
      [avagin: Make do_hres_timens() work with raw clocks too: choose vdso_data
       pointer by CS_RAW offset.]
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20191112012724.250792-21-dima@arista.com
      
      660fd04f