1. 29 3月, 2019 2 次提交
    • D
      KVM: arm/arm64: Add KVM_ARM_VCPU_FINALIZE ioctl · 7dd32a0d
      Dave Martin 提交于
      Some aspects of vcpu configuration may be too complex to be
      completed inside KVM_ARM_VCPU_INIT.  Thus, there may be a
      requirement for userspace to do some additional configuration
      before various other ioctls will work in a consistent way.
      
      In particular this will be the case for SVE, where userspace will
      need to negotiate the set of vector lengths to be made available to
      the guest before the vcpu becomes fully usable.
      
      In order to provide an explicit way for userspace to confirm that
      it has finished setting up a particular vcpu feature, this patch
      adds a new ioctl KVM_ARM_VCPU_FINALIZE.
      
      When userspace has opted into a feature that requires finalization,
      typically by means of a feature flag passed to KVM_ARM_VCPU_INIT, a
      matching call to KVM_ARM_VCPU_FINALIZE is now required before
      KVM_RUN or KVM_GET_REG_LIST is allowed.  Individual features may
      impose additional restrictions where appropriate.
      
      No existing vcpu features are affected by this, so current
      userspace implementations will continue to work exactly as before,
      with no need to issue KVM_ARM_VCPU_FINALIZE.
      
      As implemented in this patch, KVM_ARM_VCPU_FINALIZE is currently a
      placeholder: no finalizable features exist yet, so ioctl is not
      required and will always yield EINVAL.  Subsequent patches will add
      the finalization logic to make use of this ioctl for SVE.
      
      No functional change for existing userspace.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7dd32a0d
    • D
      KVM: Allow 2048-bit register access via ioctl interface · 2b953ea3
      Dave Martin 提交于
      The Arm SVE architecture defines registers that are up to 2048 bits
      in size (with some possibility of further future expansion).
      
      In order to avoid the need for an excessively large number of
      ioctls when saving and restoring a vcpu's registers, this patch
      adds a #define to make support for individual 2048-bit registers
      through the KVM_{GET,SET}_ONE_REG ioctl interface official.  This
      will allow each SVE register to be accessed in a single call.
      
      There are sufficient spare bits in the register id size field for
      this change, so there is no ABI impact, providing that
      KVM_GET_REG_LIST does not enumerate any 2048-bit register unless
      userspace explicitly opts in to the relevant architecture-specific
      features.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      2b953ea3
  2. 08 3月, 2019 3 次提交
  3. 07 3月, 2019 1 次提交
    • J
      io_uring: add support for IORING_OP_POLL · 221c5eb2
      Jens Axboe 提交于
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      221c5eb2
  4. 06 3月, 2019 2 次提交
    • J
      mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd · ab3948f5
      Joel Fernandes (Google) 提交于
      Android uses ashmem for sharing memory regions.  We are looking forward
      to migrating all usecases of ashmem to memfd so that we can possibly
      remove the ashmem driver in the future from staging while also
      benefiting from using memfd and contributing to it.  Note staging
      drivers are also not ABI and generally can be removed at anytime.
      
      One of the main usecases Android has is the ability to create a region
      and mmap it as writeable, then add protection against making any
      "future" writes while keeping the existing already mmap'ed
      writeable-region active.  This allows us to implement a usecase where
      receivers of the shared memory buffer can get a read-only view, while
      the sender continues to write to the buffer.  See CursorWindow
      documentation in Android for more details:
      
        https://developer.android.com/reference/android/database/CursorWindow
      
      This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
      To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
      which prevents any future mmap and write syscalls from succeeding while
      keeping the existing mmap active.
      
      A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
      where we don't need to modify core VFS structures to get the same
      behavior of the seal.  This solves several side-effects pointed by Andy.
      self-tests are provided in later patch to verify the expected semantics.
      
      [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/
      
      Thanks a lot to Andy for suggestions to improve code.
      
      Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab3948f5
    • D
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand 提交于
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
  5. 04 3月, 2019 2 次提交
    • F
      net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATE · 9036b2fe
      Francesco Ruggeri 提交于
      By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
      receive all IPv6 RA packets from all namespaces.
      IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
      the socket to be only from the socket's namespace.
      Signed-off-by: NMaxim Martynov <maxim@arista.com>
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9036b2fe
    • K
      sch_cake: Permit use of connmarks as tin classifiers · 0b5c7efd
      Kevin Darbyshire-Bryant 提交于
      Add flag 'FWMARK' to enable use of firewall connmarks as tin selector.
      The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie.
      for diffserv3 the mark needs to be 1->3.
      
      Background
      
      Typically CAKE uses DSCP as the basis for tin selection.  DSCP values
      are relatively easily changed as part of the egress path, usually with
      iptables & the mangle table, ingress is more challenging.  CAKE is often
      used on the WAN interface of a residential gateway where passthrough of
      DSCP from the ISP is either missing or set to unhelpful values thus use
      of ingress DSCP values for tin selection isn't helpful in that
      environment.
      
      An approach to solving the ingress tin selection problem is to use
      CAKE's understanding of tc filters.  Naive tc filters could match on
      source/destination port numbers and force tin selection that way, but
      multiple filters don't scale particularly well as each filter must be
      traversed whether it matches or not. e.g. a simple example to map 3
      firewall marks to tins:
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2
      tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3
      
      Another option is to use eBPF cls_act with tc filters e.g.
      
      MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
      tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o
      
      This has the disadvantages of a) needing someone to write & maintain
      the bpf program, b) a bpf toolchain to compile it and c) needing to
      hardcode the major number in the bpf program so it matches the cake
      instance (or forcing the cake instance to a particular major number)
      since the major number cannot be passed to the bpf program via tc
      command line.
      
      As already hinted at by the previous examples, it would be helpful
      to associate tins with something that survives the Internet path and
      ideally allows tin selection on both egress and ingress.  Netfilter's
      conntrack permits setting an identifying mark on a connection which
      can also be restored to an ingress packet with tc action connmark e.g.
      
      tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \
      	match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1
      
      Since tc's connmark action has restored any connmark into skb->mark,
      any of the previous solutions are based upon it and in one form or
      another copy that mark to the skb->priority field where again CAKE
      picks this up.
      
      This change cuts out at least one of the (less intuitive &
      non-scalable) middlemen and permit direct access to skb->mark.
      Signed-off-by: NKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5c7efd
  6. 03 3月, 2019 1 次提交
    • B
      bpf: add bpf helper bpf_skb_ecn_set_ce · f7c917ba
      brakmo 提交于
      This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
      "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
      BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
      be attached to the ingress and egress path. The helper is needed
      because his type of bpf_prog cannot modify the skb directly.
      
      This helper is used to set the ECN field of ECN capable IP packets to ce
      (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
      used by a bpf_prog to manage egress or ingress network bandwdith limit
      per cgroupv2 by inducing an ECN response in the TCP sender.
      This works best when using DCTCP.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f7c917ba
  7. 28 2月, 2019 7 次提交
    • J
      io_uring: add submission polling · 6c271ce2
      Jens Axboe 提交于
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6c271ce2
    • J
      io_uring: add file set registration · 6b06314c
      Jens Axboe 提交于
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b06314c
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      io_uring: support for IO polling · def596e9
      Jens Axboe 提交于
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      def596e9
    • C
      io_uring: add fsync support · c992fe29
      Christoph Hellwig 提交于
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c992fe29
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
    • A
      bpf: expose program stats via bpf_prog_info · 5f8f8b93
      Alexei Starovoitov 提交于
      Return bpf program run_time_ns and run_cnt via bpf_prog_info
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5f8f8b93
  8. 26 2月, 2019 3 次提交
  9. 25 2月, 2019 2 次提交
    • A
      btrfs: introduce new ioctl to unregister a btrfs device · 228a73ab
      Anand Jain 提交于
      Support for a new command that can be used eg. as a command
      
        $ btrfs device scan --forget [dev]'
      (the final name may change though)
      
      to undo the effects of 'btrfs device scan [dev]'. For this purpose
      this patch proposes to use ioctl #5 as it was empty and is next to the
      SCAN ioctl.
      
      The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
      (/dev/btrfs-control) to unregister one or all devices, devices that are
      not mounted.
      
      The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
      path. To unregister all device, the path is an empty string.
      
      Again, the devices are removed only if they aren't part of a mounte
      filesystem.
      
      This new ioctl provides:
      
      - release of unwanted btrfs_fs_devices and btrfs_devices structures
        from memory if the device is not going to be mounted
      
      - ability to mount filesystem in degraded mode, when one devices is
        corrupted like in split brain raid1
      
      - running test cases which would require reloading the kernel module
        but this is not possible eg. due to mounted filesystem or built-in
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      228a73ab
    • H
      net: phy: improve definition of __ETHTOOL_LINK_MODE_MASK_NBITS · e728fdf0
      Heiner Kallweit 提交于
      The way to define __ETHTOOL_LINK_MODE_MASK_NBITS seems to be overly
      complicated, go with a standard approach instead.
      Whilst we're at it, move the comment to the right place.
      
      v2:
      - rebased
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e728fdf0
  10. 22 2月, 2019 2 次提交
  11. 21 2月, 2019 1 次提交
  12. 20 2月, 2019 2 次提交
  13. 19 2月, 2019 3 次提交
  14. 18 2月, 2019 1 次提交
    • J
      devlink: add flash update command · 76726ccb
      Jakub Kicinski 提交于
      Add devlink flash update command. Advanced NICs have firmware
      stored in flash and often cryptographically secured. Updating
      that flash is handled by management firmware. Ethtool has a
      flash update command which served us well, however, it has two
      shortcomings:
       - it takes rtnl_lock unnecessarily - really flash update has
         nothing to do with networking, so using a networking device
         as a handle is suboptimal, which leads us to the second one:
       - it requires a functioning netdev - in case device enters an
         error state and can't spawn a netdev (e.g. communication
         with the device fails) there is no netdev to use as a handle
         for flashing.
      
      Devlink already has the ability to report the firmware versions,
      now with the ability to update the firmware/flash we will be
      able to recover devices in bad state.
      
      To enable updates of sub-components of the FW allow passing
      component name.  This name should correspond to one of the
      versions reported in devlink info.
      
      v1: - replace target id with component name (Jiri).
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76726ccb
  15. 15 2月, 2019 2 次提交
    • D
      errqueue.h: Include time_types.h · 460a2db0
      Deepa Dinamani 提交于
      Now that we have a separate header for struct __kernel_timespec,
      include it directly without relying on userspace to do it.
      Reported-by: NRan Rozenstein <ranro@mellanox.com>
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      460a2db0
    • D
      time: Add time_types.h · ca5e9aba
      Deepa Dinamani 提交于
      sys/time.h is the mandated include for many time related
      defines. However, linux/time.h overlaps sys/time.h
      significantly and this makes including both from userspace
      or one from the other impossible.
      
      This also means that userspace can get away with including
      sys/time.h whenever it needs linux/time.h and this is what's
      been happening in the user world usually.
      
      But, we have new data types that we plan to use in the uapi time
      interfaces also defined in the linux/time.h. But, we are unable
      to use these types when sys/time.h is included.
      
      Hence, move the new types to a new header, time_types.h.
      We intend to eventually have all the uapi defines that the kernel
      uses defined in this header.
      Note that the plan is to replace uapi interfaces with timeval to
      use __kernel_old_timeval, timespec to use __kernel_old_timespec etc.
      Reported-by: NRan Rozenstein <ranro@mellanox.com>
      Fixes: 9718475e ("socket: Add SO_TIMESTAMPING_NEW")
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca5e9aba
  16. 14 2月, 2019 2 次提交
    • P
      bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap · 3e0bd37c
      Peter Oskolkov 提交于
      This patch adds all needed plumbing in preparation to allowing
      bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
      implementation is added in the next patch in the patchset.
      
      Of note:
      - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
        prog types in addition to BPF_PROG_TYPE_LWT_IN;
      - if the skb being encapped has GSO set, encapsulation is limited
        to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
      - as route lookups are different for ingress vs egress, the single
        external bpf_lwt_push_encap BPF helper is routed internally to
        either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
        depending on prog type.
      
      v8 changes: fixed a typo.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e0bd37c
    • M
      net: phy: Add generic support for 2.5GBaseT and 5GBaseT · 7fd8afa8
      Maxime Chevallier 提交于
      The 802.3bz specification, based on previous by the NBASET alliance,
      defines the 2.5GBaseT and 5GBaseT link modes for ethernet traffic on
      cat5e, cat6 and cat7 cables.
      
      These mode integrate with the already defined C45 MDIO PMA/PMD registers
      set that added 10G support, by defining some previously reserved bits,
      and adding a new register (2.5G/5G Extended abilities).
      
      This commit adds the required definitions in include/uapi/linux/mdio.h
      to support these modes, and detect when a link-partner advertises them.
      
      It also adds support for these mode in the generic C45 PHY
      infrastructure.
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Reviewed-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fd8afa8
  17. 13 2月, 2019 2 次提交
    • C
      fuse: support clients that don't implement 'opendir' · d9a9ea94
      Chad Austin 提交于
      Allow filesystems to return ENOSYS from opendir, preventing the kernel from
      sending opendir and releasedir messages in the future. This avoids
      userspace transitions when filesystems don't need to keep track of state
      per directory handle.
      
      A new capability flag, FUSE_NO_OPENDIR_SUPPORT, parallels
      FUSE_NO_OPEN_SUPPORT, indicating the new semantics for returning ENOSYS
      from opendir.
      Signed-off-by: NChad Austin <chadaustin@fb.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d9a9ea94
    • K
      inet_diag: fix reporting cgroup classid and fallback to priority · 1ec17dbd
      Konstantin Khlebnikov 提交于
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ec17dbd
  18. 11 2月, 2019 2 次提交
    • M
      bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock · 655a51e5
      Martin KaFai Lau 提交于
      This patch adds a helper function BPF_FUNC_tcp_sock and it
      is currently available for cg_skb and sched_(cls|act):
      
      struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_tcp_sock *tp;
      	struct bpf_sock *sk;
      	__u32 snd_cwnd;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	tp = bpf_tcp_sock(sk);
      	if (!tp)
      		return 1;
      
      	snd_cwnd = tp->snd_cwnd;
      	/* ... */
      
      	return 1;
      }
      
      A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
      read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
      that has already been exposed by the bpf_sock_ops.
      i.e. no new tcp_sock's fields are exposed in bpf.h.
      
      This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
      or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
      returns NULL.  Hence, the caller needs to check for NULL before
      accessing it.
      
      The current use case is to expose members from tcp_sock
      to allow a cg_skb_bpf_prog to provide per cgroup traffic
      policing/shaping.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      655a51e5
    • M
      bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock · aa65d696
      Martin KaFai Lau 提交于
      This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
      bpf_sock.  The userspace has already been using "state",
      e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).
      
      This patch also allows narrow load on the following existing fields:
      "family", "type", "protocol" and "src_port".  Unlike IP address,
      the load offset is resticted to the first byte for them but it
      can be relaxed later if there is a use case.
      
      This patch also folds __sock_filter_check_size() into
      bpf_sock_is_valid_access() since it is not called
      by any where else.  All bpf_sock checking is in
      one place.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aa65d696