1. 01 2月, 2020 21 次提交
  2. 30 1月, 2020 19 次提交
    • L
      Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 39bed42d
      Linus Torvalds 提交于
      Pull mmu_notifier updates from Jason Gunthorpe:
       "This small series revises the names in mmu_notifier to make the code
        clearer and more readable"
      
      * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier
        mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier
        mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
      39bed42d
    • L
      Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 83fa805b
      Linus Torvalds 提交于
      Pull thread management updates from Christian Brauner:
       "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
        syscall.
      
        This syscall allows for the retrieval of file descriptors of a process
        based on its pidfd. A task needs to have ptrace_may_access()
        permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
        Andy) on the target.
      
        One of the main use-cases is in combination with seccomp's user
        notification feature. As a reminder, seccomp's user notification
        feature was made available in v5.0. It allows a task to retrieve a
        file descriptor for its seccomp filter. The file descriptor is usually
        handed of to a more privileged supervising process. The supervisor can
        then listen for syscall events caught by the seccomp filter of the
        supervisee and perform actions in lieu of the supervisee, usually
        emulating syscalls. pidfd_getfd() is needed to expand its uses.
      
        There are currently two major users that wait on pidfd_getfd() and one
        future user:
      
         - Netflix, Sargun said, is working on a service mesh where users
           should be able to connect to a dns-based VIP. When a user connects
           to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
           redirected to an envoy process. This service mesh uses seccomp user
           notifications and pidfd to intercept all connect calls and instead
           of connecting them to 1.2.3.4:80 connects them to e.g.
           127.0.0.1:8080.
      
         - LXD uses the seccomp notifier heavily to intercept and emulate
           mknod() and mount() syscalls for unprivileged containers/processes.
           With pidfd_getfd() more uses-cases e.g. bridging socket connections
           will be possible.
      
         - The patchset has also seen some interest from the browser corner.
           Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
           broker process. In the future glibc will start blocking all signals
           during dlopen() rendering this type of sandbox impossible. Hence,
           in the future Firefox will switch to a seccomp-user-nofication
           based sandbox which also makes use of file descriptor retrieval.
           The thread for this can be found at
           https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
      
        With pidfd_getfd() it is e.g. possible to bridge socket connections
        for the supervisee (binding to a privileged port) and taking actions
        on file descriptors on behalf of the supervisee in general.
      
        Sargun's first version was using an ioctl on pidfds but various people
        pushed for it to be a proper syscall which he duely implemented as
        well over various review cycles. Selftests are of course included.
        I've also added instructions how to deal with merge conflicts below.
      
        There's also a small fix coming from the kernel mentee project to
        correctly annotate struct sighand_struct with __rcu to fix various
        sparse warnings. We've received a few more such fixes and even though
        they are mostly trivial I've decided to postpone them until after -rc1
        since they came in rather late and I don't want to risk introducing
        build warnings.
      
        Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
        needed to avoid allocation recursions triggerable by storage drivers
        that have userspace parts that run in the IO path (e.g. dm-multipath,
        iscsi, etc). These allocation recursions deadlock the device.
      
        The new prctl() allows such privileged userspace components to avoid
        allocation recursions by setting the PF_MEMALLOC_NOIO and
        PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
        relevant maintainers and is routed here as part of prctl()
        thread-management."
      
      * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
        sched.h: Annotate sighand_struct with __rcu
        test: Add test for pidfd getfd
        arch: wire up pidfd_getfd syscall
        pid: Implement pidfd_getfd syscall
        vfs, fdtable: Add fget_task helper
      83fa805b
    • L
      Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block · 896f8d23
      Linus Torvalds 提交于
      Pull io_uring updates from Jens Axboe:
      
       - Support for various new opcodes (fallocate, openat, close, statx,
         fadvise, madvise, openat2, non-vectored read/write, send/recv, and
         epoll_ctl)
      
       - Faster ring quiesce for fileset updates
      
       - Optimizations for overflow condition checking
      
       - Support for max-sized clamping
      
       - Support for probing what opcodes are supported
      
       - Support for io-wq backend sharing between "sibling" rings
      
       - Support for registering personalities
      
       - Lots of little fixes and improvements
      
      * tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits)
        io_uring: add support for epoll_ctl(2)
        eventpoll: support non-blocking do_epoll_ctl() calls
        eventpoll: abstract out epoll_ctl() handler
        io_uring: fix linked command file table usage
        io_uring: support using a registered personality for commands
        io_uring: allow registering credentials
        io_uring: add io-wq workqueue sharing
        io-wq: allow grabbing existing io-wq
        io_uring/io-wq: don't use static creds/mm assignments
        io-wq: make the io_wq ref counted
        io_uring: fix refcounting with batched allocations at OOM
        io_uring: add comment for drain_next
        io_uring: don't attempt to copy iovec for READ/WRITE
        io_uring: honor IOSQE_ASYNC for linked reqs
        io_uring: prep req when do IOSQE_ASYNC
        io_uring: use labeled array init in io_op_defs
        io_uring: optimise sqe-to-req flags translation
        io_uring: remove REQ_F_IO_DRAINED
        io_uring: file switch work needs to get flushed on exit
        io_uring: hide uring_fd in ctx
        ...
      896f8d23
    • L
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 33c84e89
      Linus Torvalds 提交于
      Pull SCSI updates from James Bottomley:
       "This series is slightly unusual because it includes Arnd's compat
        ioctl tree here:
      
          1c46a2cf Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue
      
        Excluding Arnd's changes, this is mostly an update of the usual
        drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas.
      
        There are a couple of core and base updates around error propagation
        and atomicity in the attribute container base we use for the SCSI
        transport classes.
      
        The rest is minor changes and updates"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits)
        scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask
        scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity
        scsi: hisi_sas: Modify the file permissions of trigger_dump to write only
        scsi: hisi_sas: Replace magic number when handle channel interrupt
        scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock
        scsi: hisi_sas: use threaded irq to process CQ interrupts
        scsi: ufs: Use UFS device indicated maximum LU number
        scsi: ufs: Add max_lu_supported in struct ufs_dev_info
        scsi: ufs: Delete is_init_prefetch from struct ufs_hba
        scsi: ufs: Inline two functions into their callers
        scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init()
        scsi: ufs: Split ufshcd_probe_hba() based on its called flow
        scsi: ufs: Delete struct ufs_dev_desc
        scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
        scsi: ufs-mediatek: enable low-power mode for hibern8 state
        scsi: ufs: export some functions for vendor usage
        scsi: ufs-mediatek: add dbg_register_dump implementation
        scsi: qla2xxx: Fix a NULL pointer dereference in an error path
        scsi: qla1280: Make checking for 64bit support consistent
        scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1
        ...
      33c84e89
    • L
      Merge tag 'for-5.6/dm-changes' of... · e9f8ca0a
      Linus Torvalds 提交于
      Merge tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - Fix DM core's potential for q->make_request_fn NULL pointer in the
         unlikely case that a DM device is created without a DM table and then
         accessed due to upper-layer userspace code or user error.
      
       - Fix DM thin-provisioning's metadata_pre_commit_callback to not use
         memory after it is free'd. Also refactor code to disallow changing
         the thin-pool's data device once in use -- doing so guarantees smae
         lifetime of pool's data device relative to the pool metadata.
      
       - Fix DM space maps used by DM thinp and DM cache to avoid reuse of a
         already used block. This race was identified with extremely heavy
         snapshot use in the context of DM thin provisioning.
      
       - Fix DM raid's table status relative to an active rebuild.
      
       - Fix DM crypt to use GFP_NOIO rather than GFP_NOFS in call to
         skcipher_request_alloc(). Also fix benbi IV constructor crash if used
         in authenticated mode.
      
       - Add DM crypt support for Elephant diffuser to allow for Bitlocker
         compatibility.
      
       - Fix DM verity target to not prefetch hash blocks for data that has
         already been verified.
      
       - Fix DM writecache's incorrect flush sequence during commit when in
         SSD mode.
      
       - Improve DM writecache's sequential write performance on SSDs.
      
       - Add DM zoned target support for zone sizes smaller than 128MiB.
      
       - Add DM multipath 'queue_if_no_path_timeout_secs' module param to
         allow timeout if path isn't reinstated. This allows users a kernel
         safety-net against IO hanging indefinitely, due to no active paths,
         that has historically only been provided by multipathd userspace.
      
       - Various DM code cleanups to use true/false rather than 1/0, a
         variable rename in dm-dust, and fix for a math error in comment for
         DM thin metadata's ondisk format.
      
      * tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (21 commits)
        dm: fix potential for q->make_request_fn NULL pointer
        dm writecache: improve performance of large linear writes on SSDs
        dm mpath: Add timeout mechanism for queue_if_no_path
        dm thin: change data device's flush_bio to be member of struct pool
        dm thin: don't allow changing data device during thin-pool reload
        dm thin: fix use-after-free in metadata_pre_commit_callback
        dm thin metadata: use pool locking at end of dm_pool_metadata_close
        dm writecache: fix incorrect flush sequence when doing SSD mode commit
        dm crypt: fix benbi IV constructor crash if used in authenticated mode
        dm crypt: Implement Elephant diffuser for Bitlocker compatibility
        dm space map common: fix to ensure new block isn't already in use
        dm verity: don't prefetch hash blocks for already-verified data
        dm crypt: fix GFP flags passed to skcipher_request_alloc()
        dm thin metadata: Fix trivial math error in on-disk format documentation
        dm thin metadata: use true/false for bool variable
        dm snapshot: use true/false for bool variable
        dm bio prison v2: use true/false for bool variable
        dm mpath: use true/false for bool variable
        dm zoned: support zone sizes smaller than 128MiB
        dm raid: table line rebuild status fixes
        ...
      e9f8ca0a
    • L
      Merge tag 'docs-5.6' of git://git.lwn.net/linux · 05ef8b97
      Linus Torvalds 提交于
      Pull documentation updates from Jonathan Corbet:
       "It has been a relatively quiet cycle for documentation, but there's
        still a couple of things of note:
      
         - Conversion of the NFS documentation to RST
      
         - A new document on how to help with documentation (and a maintainer
           profile entry too)
      
        Plus the usual collection of typo fixes, etc"
      
      * tag 'docs-5.6' of git://git.lwn.net/linux: (40 commits)
        docs: filesystems: add overlayfs to index.rst
        docs: usb: remove some broken references
        scripts/find-unused-docs: Fix massive false positives
        docs: nvdimm: use ReST notation for subsection
        zram: correct documentation about sysfs node of huge page writeback
        Documentation: zram: various fixes in zram.rst
        Add a maintainer entry profile for documentation
        Add a document on how to contribute to the documentation
        docs: Keep up with the location of NoUri
        Documentation: Call out example SYM_FUNC_* usage as x86-specific
        Documentation: nfs: fault_injection: convert to ReST
        Documentation: nfs: pnfs-scsi-server: convert to ReST
        Documentation: nfs: convert pnfs-block-server to ReST
        Documentation: nfs: idmapper: convert to ReST
        Documentation: convert nfsd-admin-interfaces to ReST
        Documentation: nfs-rdma: convert to ReST
        Documentation: nfsroot.rst: COSMETIC: refill a paragraph
        Documentation: nfsroot.txt: convert to ReST
        Documentation: convert nfs.txt to ReST
        Documentation: filesystems: convert vfat.txt to RST
        ...
      05ef8b97
    • L
      Merge tag 'linux-kselftest-5.6-rc1-kunit' of... · 08a3ef8f
      Linus Torvalds 提交于
      Merge tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest kunit updates from Shuah Khan:
       "This kunit update consists of:
      
         - Support for building kunit as a module from Alan Maguire
      
         - AppArmor KUnit tests for policy unpack from Mike Salvatore"
      
      * tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: building kunit as a module breaks allmodconfig
        kunit: update documentation to describe module-based build
        kunit: allow kunit to be loaded as a module
        kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
        kunit: allow kunit tests to be loaded as a module
        kunit: hide unexported try-catch interface in try-catch-impl.h
        kunit: move string-stream.h to lib/kunit
        apparmor: add AppArmor KUnit tests for policy unpack
      08a3ef8f
    • L
      Merge tag 'linux-kselftest-5.6-rc1' of... · ce7ae9d9
      Linus Torvalds 提交于
      Merge tag 'linux-kselftest-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest update from Shuah Khan:
       "This Kselftest update consists of several fixes to framework and
        individual tests.
      
        In addition, it enables LKDTM tests adding lkdtm target to kselftest
        Makefile"
      
      * tag 'linux-kselftest-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/ftrace: fix glob selftest
        selftests: settings: tests can be in subsubdirs
        kselftest: Minimise dependency of get_size on C library interfaces
        selftests/livepatch: Remove unused local variable in set_ftrace_enabled()
        selftests/livepatch: Replace set_dynamic_debug() with setup_config() in README
        selftests/lkdtm: Add tests for LKDTM targets
        selftests: Uninitialized variable in test_cgcore_proc_migration()
        selftests: fix build behaviour on targets' failures
      ce7ae9d9
    • L
      Merge tag 'y2038-drivers-for-v5.6-signed' of... · 22b17db4
      Linus Torvalds 提交于
      Merge tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
      
      Pull y2038 updates from Arnd Bergmann:
       "Core, driver and file system changes
      
        These are updates to device drivers and file systems that for some
        reason or another were not included in the kernel in the previous
        y2038 series.
      
        I've gone through all users of time_t again to make sure the kernel is
        in a long-term maintainable state, replacing all remaining references
        to time_t with safe alternatives.
      
        Some related parts of the series were picked up into the nfsd, xfs,
        alsa and v4l2 trees. A final set of patches in linux-mm removes the
        now unused time_t/timeval/timespec types and helper functions after
        all five branches are merged for linux-5.6, ensuring that no new users
        get merged.
      
        As a result, linux-5.6, or my backport of the patches to 5.4 [1],
        should be the first release that can serve as a base for a 32-bit
        system designed to run beyond year 2038, with a few remaining caveats:
      
         - All user space must be compiled with a 64-bit time_t, which will be
           supported in the coming musl-1.2 and glibc-2.32 releases, along
           with installed kernel headers from linux-5.6 or higher.
      
         - Applications that use the system call interfaces directly need to
           be ported to use the time64 syscalls added in linux-5.1 in place of
           the existing system calls. This impacts most users of futex() and
           seccomp() as well as programming languages that have their own
           runtime environment not based on libc.
      
         - Applications that use a private copy of kernel uapi header files or
           their contents may need to update to the linux-5.6 version, in
           particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
           linux/elfcore.h, linux/sockios.h, linux/timex.h and
           linux/can/bcm.h.
      
         - A few remaining interfaces cannot be changed to pass a 64-bit
           time_t in a compatible way, so they must be configured to use
           CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
           timestamps. Most importantly this impacts all users of 'struct
           input_event'.
      
         - All y2038 problems that are present on 64-bit machines also apply
           to 32-bit machines. In particular this affects file systems with
           on-disk timestamps using signed 32-bit seconds: ext4 with
           ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
      
      * tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
        Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
        y2038: sh: remove timeval/timespec usage from headers
        y2038: sparc: remove use of struct timex
        y2038: rename itimerval to __kernel_old_itimerval
        y2038: remove obsolete jiffies conversion functions
        nfs: fscache: use timespec64 in inode auxdata
        nfs: fix timstamp debug prints
        nfs: use time64_t internally
        sunrpc: convert to time64_t for expiry
        drm/etnaviv: avoid deprecated timespec
        drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
        drm/msm: avoid using 'timespec'
        hfs/hfsplus: use 64-bit inode timestamps
        hostfs: pass 64-bit timestamps to/from user space
        packet: clarify timestamp overflow
        tsacct: add 64-bit btime field
        acct: stop using get_seconds()
        um: ubd: use 64-bit time_t where possible
        xtensa: ISS: avoid struct timeval
        dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
        ...
      22b17db4
    • L
      Merge tag 'printk-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk · a4fe2b4d
      Linus Torvalds 提交于
      Pull printk update from Petr Mladek:
       "Prevent replaying log on all consoles"
      
      * tag 'printk-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
        printk: fix exclusive_console replaying
      a4fe2b4d
    • J
      io_uring: add support for epoll_ctl(2) · 3e4827b0
      Jens Axboe 提交于
      This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
      epoll_ctl(2) system call.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3e4827b0
    • J
      eventpoll: support non-blocking do_epoll_ctl() calls · 39220e8d
      Jens Axboe 提交于
      Also make it available outside of epoll, along with the helper that
      decides if we need to copy the passed in epoll_event.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      39220e8d
    • J
      eventpoll: abstract out epoll_ctl() handler · 58e41a44
      Jens Axboe 提交于
      No functional changes in this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      58e41a44
    • J
      io_uring: fix linked command file table usage · f86cd20c
      Jens Axboe 提交于
      We're not consistent in how the file table is grabbed and assigned if we
      have a command linked that requires the use of it.
      
      Add ->file_table to the io_op_defs[] array, and use that to determine
      when to grab the table instead of having the handlers set it if they
      need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
      flag. We always initialize work->files, so io-wq can just check for
      that.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f86cd20c
    • L
      Merge tag 'erofs-for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 3893c202
      Linus Torvalds 提交于
      Pull erofs updates from Gao Xiang:
       "A regression fix, several cleanups and (maybe) plus an upcoming new
        mount api convert patch as a part of vfs update are considered
        available for this cycle.
      
        All commits have been in linux-next and tested with no smoke out.
      
        Summary:
      
         - fix an out-of-bound read access introduced in v5.3, which could
           rarely cause data corruption
      
         - various cleanup patches"
      
      * tag 'erofs-for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: clean up z_erofs_submit_queue()
        erofs: fold in postsubmit_is_all_bypassed()
        erofs: fix out-of-bound read for shifted uncompressed block
        erofs: remove void tagging/untagging of workgroup pointers
        erofs: remove unused tag argument while registering a workgroup
        erofs: remove unused tag argument while finding a workgroup
        erofs: correct indentation of an assigned structure inside a function
      3893c202
    • L
      Merge branch 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 53070406
      Linus Torvalds 提交于
      Pull adfs updates from Al Viro:
       "adfs stuff for this cycle"
      
      * 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
        fs/adfs: bigdir: Fix an error code in adfs_fplus_read()
        Documentation: update adfs filesystem documentation
        fs/adfs: mostly divorse inode number from indirect disc address
        fs/adfs: super: add support for E and E+ floppy image formats
        fs/adfs: super: extract filesystem block probe
        fs/adfs: dir: remove debug in adfs_dir_update()
        fs/adfs: super: fix inode dropping
        fs/adfs: bigdir: implement directory update support
        fs/adfs: bigdir: calculate and validate directory checkbyte
        fs/adfs: bigdir: directory validation strengthening
        fs/adfs: bigdir: extract directory validation
        fs/adfs: bigdir: factor out directory entry offset calculation
        fs/adfs: newdir: split out directory commit from update
        fs/adfs: newdir: clean up adfs_f_update()
        fs/adfs: newdir: merge adfs_dir_read() into adfs_f_read()
        fs/adfs: newdir: improve directory validation
        fs/adfs: newdir: factor out directory format validation
        fs/adfs: dir: use pointers to access directory head/tails
        fs/adfs: dir: add more efficient iterate() per-format method
        fs/adfs: dir: switch to iterate_shared method
        ...
      53070406
    • L
      Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 6aee4bad
      Linus Torvalds 提交于
      Pull openat2 support from Al Viro:
       "This is the openat2() series from Aleksa Sarai.
      
        I'm afraid that the rest of namei stuff will have to wait - it got
        zero review the last time I'd posted #work.namei, and there had been a
        leak in the posted series I'd caught only last weekend. I was going to
        repost it on Monday, but the window opened and the odds of getting any
        review during that... Oh, well.
      
        Anyway, openat2 part should be ready; that _did_ get sane amount of
        review and public testing, so here it comes"
      
      From Aleksa's description of the series:
       "For a very long time, extending openat(2) with new features has been
        incredibly frustrating. This stems from the fact that openat(2) is
        possibly the most famous counter-example to the mantra "don't silently
        accept garbage from userspace" -- it doesn't check whether unknown
        flags are present[1].
      
        This means that (generally) the addition of new flags to openat(2) has
        been fraught with backwards-compatibility issues (O_TMPFILE has to be
        defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
        kernels gave errors, since it's insecure to silently ignore the
        flag[2]). All new security-related flags therefore have a tough road
        to being added to openat(2).
      
        Furthermore, the need for some sort of control over VFS's path
        resolution (to avoid malicious paths resulting in inadvertent
        breakouts) has been a very long-standing desire of many userspace
        applications.
      
        This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
        (which was a variant of David Drysdale's O_BENEATH patchset[4] which
        was a spin-off of the Capsicum project[5]) with a few additions and
        changes made based on the previous discussion within [6] as well as
        others I felt were useful.
      
        In line with the conclusions of the original discussion of
        AT_NO_JUMPS, the flag has been split up into separate flags. However,
        instead of being an openat(2) flag it is provided through a new
        syscall openat2(2) which provides several other improvements to the
        openat(2) interface (see the patch description for more details). The
        following new LOOKUP_* flags are added:
      
        LOOKUP_NO_XDEV:
      
           Blocks all mountpoint crossings (upwards, downwards, or through
           absolute links). Absolute pathnames alone in openat(2) do not
           trigger this. Magic-link traversal which implies a vfsmount jump is
           also blocked (though magic-link jumps on the same vfsmount are
           permitted).
      
        LOOKUP_NO_MAGICLINKS:
      
           Blocks resolution through /proc/$pid/fd-style links. This is done
           by blocking the usage of nd_jump_link() during resolution in a
           filesystem. The term "magic-links" is used to match with the only
           reference to these links in Documentation/, but I'm happy to change
           the name.
      
           It should be noted that this is different to the scope of
           ~LOOKUP_FOLLOW in that it applies to all path components. However,
           you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
           will *not* fail (assuming that no parent component was a
           magic-link), and you will have an fd for the magic-link.
      
           In order to correctly detect magic-links, the introduction of a new
           LOOKUP_MAGICLINK_JUMPED state flag was required.
      
        LOOKUP_BENEATH:
      
           Disallows escapes to outside the starting dirfd's
           tree, using techniques such as ".." or absolute links. Absolute
           paths in openat(2) are also disallowed.
      
           Conceptually this flag is to ensure you "stay below" a certain
           point in the filesystem tree -- but this requires some additional
           to protect against various races that would allow escape using
           "..".
      
           Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
           can trivially beam you around the filesystem (breaking the
           protection). In future, there might be similar safety checks done
           as in LOOKUP_IN_ROOT, but that requires more discussion.
      
        In addition, two new flags are added that expand on the above ideas:
      
        LOOKUP_NO_SYMLINKS:
      
           Does what it says on the tin. No symlink resolution is allowed at
           all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
           can still be used with NOFOLLOW to open an fd for the symlink as
           long as no parent path had a symlink component.
      
        LOOKUP_IN_ROOT:
      
           This is an extension of LOOKUP_BENEATH that, rather than blocking
           attempts to move past the root, forces all such movements to be
           scoped to the starting point. This provides chroot(2)-like
           protection but without the cost of a chroot(2) for each filesystem
           operation, as well as being safe against race attacks that
           chroot(2) is not.
      
           If a race is detected (as with LOOKUP_BENEATH) then an error is
           generated, and similar to LOOKUP_BENEATH it is not permitted to
           cross magic-links with LOOKUP_IN_ROOT.
      
           The primary need for this is from container runtimes, which
           currently need to do symlink scoping in userspace[7] when opening
           paths in a potentially malicious container.
      
           There is a long list of CVEs that could have bene mitigated by
           having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
           CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
           few).
      
        In order to make all of the above more usable, I'm working on
        libpathrs[8] which is a C-friendly library for safe path resolution.
        It features a userspace-emulated backend if the kernel doesn't support
        openat2(2). Hopefully we can get userspace to switch to using it, and
        thus get openat2(2) support for free once it's ready.
      
        Future work would include implementing things like
        RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
        programs to be sure they don't hit DoSes though stale NFS handles)"
      
      * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        Documentation: path-lookup: include new LOOKUP flags
        selftests: add openat2(2) selftests
        open: introduce openat2(2) syscall
        namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
        namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
        namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
        namei: LOOKUP_NO_XDEV: block mountpoint crossing
        namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
        namei: LOOKUP_NO_SYMLINKS: block symlink resolution
        namei: allow set_root() to produce errors
        namei: allow nd_jump_link() to produce errors
        nsfs: clean-up ns_get_path() signature to return int
        namei: only return -ECHILD from follow_dotdot_rcu()
      6aee4bad
    • L
      Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu · 15d66324
      Linus Torvalds 提交于
      Pull RCU warning removal from Paul McKenney:
       "A single commit that fixes an embarrassing bug discussed here:
      
            https://lore.kernel.org/lkml/20200125131425.GB16136@zn.tnic/
      
        which apparently also affects smaller systems"
      
      [ This was sent to Ingo, but since I see the issue on the laptop I use for
        testing during the merge window, I'm doing the pull directly     - Linus ]
      
      * 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
        rcu: Forgive slow expedited grace periods at boot time
      15d66324
    • L
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 80b60e38
      Linus Torvalds 提交于
      Pull core fixes from Ingo Molnar:
       "Three objtool fixes, plus marking SFI as obsolete"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Skip samples subdirectory
        objtool: Fix ARCH=x86_64 build error
        objtool: Silence build output
        MAINTAINERS: Mark simple firmware interface (SFI) obsolete
      80b60e38