1. 24 1月, 2021 1 次提交
    • C
      fs: add mount_setattr() · 2a186721
      Christian Brauner 提交于
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
  2. 20 12月, 2020 1 次提交
  3. 02 12月, 2020 1 次提交
  4. 01 12月, 2020 2 次提交
    • B
      net: Add SO_BUSY_POLL_BUDGET socket option · 7c951caf
      Björn Töpel 提交于
      This option lets a user set a per socket NAPI budget for
      busy-polling. If the options is not set, it will use the default of 8.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
      7c951caf
    • B
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel 提交于
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  5. 24 11月, 2020 4 次提交
  6. 13 11月, 2020 1 次提交
  7. 26 10月, 2020 1 次提交
  8. 19 10月, 2020 1 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  9. 03 10月, 2020 3 次提交
  10. 23 9月, 2020 1 次提交
  11. 15 9月, 2020 1 次提交
  12. 04 9月, 2020 1 次提交
  13. 20 7月, 2020 1 次提交
    • C
      net: remove compat_sys_{get,set}sockopt · 55db9c0e
      Christoph Hellwig 提交于
      Now that the ->compat_{get,set}sockopt proto_ops methods are gone
      there is no good reason left to keep the compat syscalls separate.
      
      This fixes the odd use of unsigned int for the compat_setsockopt
      optlen and the missing sock_use_custom_sol_socket.
      
      It would also easily allow running the eBPF hooks for the compat
      syscalls, but such a large change in behavior does not belong into
      a consolidation patch like this one.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55db9c0e
  14. 17 6月, 2020 1 次提交
    • C
      arch: wire-up close_range() · 9b4feb63
      Christian Brauner 提交于
      This wires up the close_range() syscall into all arches at once.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      9b4feb63
  15. 14 5月, 2020 1 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
  16. 22 2月, 2020 1 次提交
  17. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  18. 17 1月, 2020 1 次提交
  19. 14 1月, 2020 1 次提交
  20. 05 12月, 2019 3 次提交
    • M
      arch: sembuf.h: make uapi asm/sembuf.h self-contained · 0fb9dc28
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/sembuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/sembuf.h.s
        In file included from <command-line>:32:0:
        usr/include/asm/sembuf.h:17:20: error: field `sem_perm' has incomplete type
          struct ipc64_perm sem_perm; /* permissions .. see ipc.h */
                            ^~~~~~~~
        usr/include/asm/sembuf.h:24:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_otime; /* last semop time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:25:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused1;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:26:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:27:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused2;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:29:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t sem_nsems; /* no. of semaphores in array */
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:30:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused3;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:31:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused4;
          ^~~~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-3-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fb9dc28
    • M
      arch: msgbuf.h: make uapi asm/msgbuf.h self-contained · 9ef0e004
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/msgbuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/msgbuf.h.s
        In file included from usr/include/asm/msgbuf.h:6:0,
                         from <command-line>:32:
        usr/include/asm-generic/msgbuf.h:25:20: error: field `msg_perm' has incomplete type
          struct ipc64_perm msg_perm;
                            ^~~~~~~~
        usr/include/asm-generic/msgbuf.h:27:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_stime; /* last msgsnd time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:28:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_rtime; /* last msgrcv time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:29:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:41:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lspid; /* pid of last msgsnd */
          ^~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:42:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lrpid; /* last receive pid */
          ^~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-2-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ef0e004
    • M
      arch: ipcbuf.h: make uapi asm/ipcbuf.h self-contained · 5b009673
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/ipcbuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/ipcbuf.h.s
        In file included from usr/include/asm/ipcbuf.h:1:0,
                         from <command-line>:32:
        usr/include/asm-generic/ipcbuf.h:21:2: error: unknown type name `__kernel_key_t'
          __kernel_key_t  key;
          ^~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:22:2: error: unknown type name `__kernel_uid32_t'
          __kernel_uid32_t uid;
          ^~~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:23:2: error: unknown type name `__kernel_gid32_t'
          __kernel_gid32_t gid;
          ^~~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:24:2: error: unknown type name `__kernel_uid32_t'
          __kernel_uid32_t cuid;
          ^~~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:25:2: error: unknown type name `__kernel_gid32_t'
          __kernel_gid32_t cgid;
          ^~~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:26:2: error: unknown type name `__kernel_mode_t'
          __kernel_mode_t  mode;
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:28:35: error: `__kernel_mode_t' undeclared here (not in a function)
          unsigned char  __pad1[4 - sizeof(__kernel_mode_t)];
                                           ^~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:31:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused1;
          ^~~~~~~~~~~~~~~~
        usr/include/asm-generic/ipcbuf.h:32:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused2;
          ^~~~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <linux/posix_types.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-1-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b009673
  21. 15 11月, 2019 2 次提交
    • A
      y2038: ipc: remove __kernel_time_t reference from headers · caf5e32d
      Arnd Bergmann 提交于
      There are two structures based on time_t that conflict between libc and
      kernel: timeval and timespec. Both are now renamed to __kernel_old_timeval
      and __kernel_old_timespec.
      
      For time_t, the old typedef is still __kernel_time_t. There is nothing
      wrong with that name, but it would be nice to not use that going forward
      as this type is used almost only in deprecated interfaces because of
      the y2038 overflow.
      
      In the IPC headers (msgbuf.h, sembuf.h, shmbuf.h), __kernel_time_t is only
      used for the 64-bit variants, which are not deprecated.
      
      Change these to a plain 'long', which is the same type as __kernel_time_t
      on all 64-bit architectures anyway, to reduce the number of users of the
      old type.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      caf5e32d
    • A
      y2038: add __kernel_old_timespec and __kernel_old_time_t · 94c467dd
      Arnd Bergmann 提交于
      The 'struct timespec' definition can no longer be part of the uapi headers
      because it conflicts with a a now incompatible libc definition. Also,
      we really want to remove it in order to prevent new uses from creeping in.
      
      The same namespace conflict exists with time_t, which should also be
      removed. __kernel_time_t could be used safely, but adding 'old' in the
      name makes it clearer that this should not be used for new interfaces.
      
      Add a replacement __kernel_old_timespec structure and __kernel_old_time_t
      along the lines of __kernel_old_timeval.
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      94c467dd
  22. 26 9月, 2019 2 次提交
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  23. 07 9月, 2019 1 次提交
    • A
      ipc: fix semtimedop for generic 32-bit architectures · 78e05972
      Arnd Bergmann 提交于
      As Vincent noticed, the y2038 conversion of semtimedop in linux-5.1
      broke when commit 00bf25d6 ("y2038: use time32 syscall names on
      32-bit") changed all system calls on all architectures that take
      a 32-bit time_t to point to the _time32 implementation, but left out
      semtimedop in the asm-generic header.
      
      This affects all 32-bit architectures using asm-generic/unistd.h:
      h8300, unicore32, openrisc, nios2, hexagon, c6x, arc, nds32 and csky.
      
      The notable exception is riscv32, which has dropped support for the
      time32 system calls entirely.
      Reported-by: NVincent Chen <deanbo422@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Fixes: 00bf25d6 ("y2038: use time32 syscall names on 32-bit")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      78e05972
  24. 17 7月, 2019 3 次提交
  25. 15 7月, 2019 1 次提交
  26. 28 6月, 2019 1 次提交
    • C
      arch: wire-up pidfd_open() · 7615d9e1
      Christian Brauner 提交于
      This wires up the pidfd_open() syscall into all arches at once.
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      7615d9e1
  27. 15 6月, 2019 1 次提交
  28. 09 6月, 2019 1 次提交
    • C
      arch: wire-up clone3() syscall · 8f3220a8
      Christian Brauner 提交于
      Wire up the clone3() call on all arches that don't require hand-rolled
      assembly.
      
      Some of the arches look like they need special assembly massaging and it is
      probably smarter if the appropriate arch maintainers would do the actual
      wiring. Arches that are wired-up are:
      - x86{_32,64}
      - arm{64}
      - xtensa
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      8f3220a8