1. 19 10月, 2020 1 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  2. 14 5月, 2020 1 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
  3. 20 3月, 2020 1 次提交
  4. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  5. 14 1月, 2020 1 次提交
  6. 07 1月, 2020 1 次提交
  7. 28 6月, 2019 1 次提交
    • C
      arch: wire-up pidfd_open() · 7615d9e1
      Christian Brauner 提交于
      This wires up the pidfd_open() syscall into all arches at once.
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      7615d9e1
  8. 23 6月, 2019 1 次提交
  9. 21 6月, 2019 1 次提交
    • C
      arch: handle arches who do not yet define clone3 · d68dbb0c
      Christian Brauner 提交于
      This cleanly handles arches who do not yet define clone3.
      
      clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
      assumption that this would cleanly handle all architectures. It does
      not.
      Architectures such as nios2 or h8300 simply take the asm-generic syscall
      definitions and generate their syscall table from it. Since they don't
      define __ARCH_WANT_SYS_CLONE the build would fail complaining about
      sys_clone3 missing. The reason this doesn't happen for legacy clone is
      that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
      be done for architectural reasons.
      
      The build failures for nios2 and h8300 were caught int -next luckily.
      The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
      add. Additionally, we need a cond_syscall(clone3) for architectures such
      as nios2 or h8300 that generate their syscall table in the way I
      explained above.
      
      Fixes: 8f3220a8 ("arch: wire-up clone3() syscall")
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      d68dbb0c
  10. 19 6月, 2019 1 次提交
  11. 09 6月, 2019 1 次提交
    • C
      arch: wire-up clone3() syscall · 8f3220a8
      Christian Brauner 提交于
      Wire up the clone3() call on all arches that don't require hand-rolled
      assembly.
      
      Some of the arches look like they need special assembly massaging and it is
      probably smarter if the appropriate arch maintainers would do the actual
      wiring. Arches that are wired-up are:
      - x86{_32,64}
      - arm{64}
      - xtensa
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      8f3220a8
  12. 17 5月, 2019 1 次提交
  13. 15 4月, 2019 1 次提交
  14. 07 2月, 2019 1 次提交
    • A
      y2038: add 64-bit time_t syscalls to all 32-bit architectures · 48166e6e
      Arnd Bergmann 提交于
      This adds 21 new system calls on each ABI that has 32-bit time_t
      today. All of these have the exact same semantics as their existing
      counterparts, and the new ones all have macro names that end in 'time64'
      for clarification.
      
      This gets us to the point of being able to safely use a C library
      that has 64-bit time_t in user space. There are still a couple of
      loose ends to tie up in various areas of the code, but this is the
      big one, and should be entirely uncontroversial at this point.
      
      In particular, there are four system calls (getitimer, setitimer,
      waitid, and getrusage) that don't have a 64-bit counterpart yet,
      but these can all be safely implemented in the C library by wrapping
      around the existing system calls because the 32-bit time_t they
      pass only counts elapsed time, not time since the epoch. They
      will be dealt with later.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      48166e6e
  15. 26 1月, 2019 2 次提交
    • A
      ARM: add kexec_file_load system call number · 4ab65ba7
      Arnd Bergmann 提交于
      A couple of architectures including arm64 already implement the
      kexec_file_load system call, on many others we have assigned a system
      call number for it, but not implemented it yet.
      
      Adding the number in arch/arm/ lets us use the system call on arm64
      systems in compat mode, and also reduces the number of differences
      between architectures. If we want to implement kexec_file_load on ARM
      in the future, the number assignment means that kexec tools can already
      be built with the now current set of kernel headers.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      4ab65ba7
    • A
      ARM: add migrate_pages() system call · 78594b95
      Arnd Bergmann 提交于
      The migrate_pages system call has an assigned number on all architectures
      except ARM. When it got added initially in commit d80ade7b ("ARM:
      Fix warning: #warning syscall migrate_pages not implemented"), it was
      intentionally left out based on the observation that there are no 32-bit
      ARM NUMA systems.
      
      However, there are now arm64 NUMA machines that can in theory run 32-bit
      kernels (actually enabling NUMA there would require additional work)
      as well as 32-bit user space on 64-bit kernels, so that argument is no
      longer very strong.
      
      Assigning the number lets us use the system call on 64-bit kernels as well
      as providing a more consistent set of syscalls across architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      78594b95
  16. 04 1月, 2019 2 次提交
  17. 29 8月, 2018 2 次提交
    • A
      y2038: utimes: Rework #ifdef guards for compat syscalls · 4faea239
      Arnd Bergmann 提交于
      After changing over to 64-bit time_t syscalls, many architectures will
      want compat_sys_utimensat() but not respective handlers for utime(),
      utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
      complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
      support CONFIG_COMPAT set it, but future 64-bit architectures will not
      (tile would not have needed it either, but got removed).
      
      As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
      they will have to use __ARCH_WANT_SYS_UTIME32 instead of
      __ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
      need either of them as they never had a utime syscall.
      
      Since the compat_utimbuf structure is now required outside of
      CONFIG_COMPAT, I'm moving it into compat_time.h.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      ---
      changed from last version:
      - renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
      4faea239
    • A
      asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro · caf6f9c8
      Arnd Bergmann 提交于
      The sys_llseek sytem call is needed on all 32-bit architectures and
      none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
      and simplify the include/asm-generic/unistd.h header further.
      
      Since 32-bit tasks can run either natively or in compat mode on 64-bit
      architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.
      
      There are a few 64-bit architectures that also reference sys_llseek
      in their 64-bit ABI (e.g. sparc), but I verified that those all
      select CONFIG_COMPAT, so the #if check is still correct here. It's
      a bit odd to include it in the syscall table though, as it's the
      same as sys_lseek() on 64-bit, but with strange calling conventions.
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      caf6f9c8
  18. 11 7月, 2018 1 次提交
  19. 18 4月, 2017 1 次提交
    • A
      Remove compat_sys_getdents64() · 2611dc19
      Al Viro 提交于
      Unlike normal compat syscall variants, it is needed only for
      biarch architectures that have different alignement requirements for
      u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
      a store of 64bit value at 32bit-aligned address.  We used to have one
      such (ia64), but its biarch support has been gone since 2010 (after
      being broken in 2008, which went unnoticed since nobody had been using
      it).
      
      It had escaped removal at the same time only because back in 2004
      a patch that switched several syscalls on amd64 from private wrappers to
      generic compat ones had switched to use of compat_sys_getdents64(), which
      hadn't needed (or used) a compat wrapper on amd64.
      
      Let's bury it - it's at least 7 years overdue.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2611dc19
  20. 22 3月, 2017 1 次提交
  21. 02 6月, 2016 1 次提交
  22. 14 10月, 2015 1 次提交
  23. 27 1月, 2015 1 次提交
  24. 13 1月, 2015 1 次提交
  25. 07 1月, 2015 1 次提交
    • M
      arm64: Correct __NR_compat_syscalls for bpf · 0f9132ce
      Mark Rutland 提交于
      Commit 97b56be1 (arm64: compat: Enable bpf syscall) made the
      usual mistake of forgetting to update __NR_compat_syscalls. Due to this,
      when el0_sync_compat calls el0_svc_naked, the test against sc_nr
      (__NR_compat_syscalls) will fail, and we'll call ni_sys, returning
      -ENOSYS to userspace.
      
      This patch bumps __NR_compat_syscalls appropriately, enabling the use of
      the bpf syscall from compat tasks.
      
      Due to the reorganisation of unistd{,32}.h as part of commit
      f3e5c847 (arm64: Add __NR_* definitions for compat syscalls) it
      is not currently possible to include both headers and sanity-check the
      value of __NR_compat_syscalls at build-time to prevent this from
      happening again. Additional rework is required to make such niceties a
      possibility.
      
      Cc: Will Deacon <will.deacon@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      0f9132ce
  26. 28 11月, 2014 1 次提交
  27. 19 8月, 2014 1 次提交
  28. 10 7月, 2014 1 次提交
    • C
      arm64: Add __NR_* definitions for compat syscalls · f3e5c847
      Catalin Marinas 提交于
      This patch adds __NR_* definitions to asm/unistd32.h, moves the
      __NR_compat_* definitions to asm/unistd.h and removes all the explicit
      unistd32.h includes apart from the one building the compat syscall
      table. The aim is to have the compat __NR_* definitions available but
      without colliding with the native syscall definitions (required by
      lib/compat_audit.c to avoid duplicating the audit header files between
      native and compat).
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      f3e5c847
  29. 29 5月, 2014 1 次提交
  30. 04 3月, 2014 1 次提交
    • H
      compat: let architectures define __ARCH_WANT_COMPAT_SYS_GETDENTS64 · 0473c9b5
      Heiko Carstens 提交于
      For architecture dependent compat syscalls in common code an architecture
      must define something like __ARCH_WANT_<WHATEVER> if it wants to use the
      code.
      This however is not true for compat_sys_getdents64 for which architectures
      must define __ARCH_OMIT_COMPAT_SYS_GETDENTS64 if they do not want the code.
      
      This leads to the situation where all architectures, except mips, get the
      compat code but only x86_64, arm64 and the generic syscall architectures
      actually use it.
      
      So invert the logic, so that architectures actively must do something to
      get the compat code.
      
      This way a couple of architectures get rid of otherwise dead code.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      0473c9b5
  31. 14 2月, 2013 1 次提交
    • A
      burying unused conditionals · d64008a8
      Al Viro 提交于
      __ARCH_WANT_SYS_RT_SIGACTION,
      __ARCH_WANT_SYS_RT_SIGSUSPEND,
      __ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND,
      __ARCH_WANT_COMPAT_SYS_SCHED_RR_GET_INTERVAL - not used anymore
      CONFIG_GENERIC_{SIGALTSTACK,COMPAT_RT_SIG{ACTION,QUEUEINFO,PENDING,PROCMASK}} -
      can be assumed always set.
      d64008a8
  32. 20 12月, 2012 1 次提交
  33. 18 12月, 2012 1 次提交
  34. 29 11月, 2012 1 次提交
  35. 09 11月, 2012 1 次提交
  36. 17 10月, 2012 1 次提交
  37. 11 10月, 2012 1 次提交