1. 04 7月, 2018 11 次提交
  2. 02 7月, 2018 6 次提交
  3. 30 6月, 2018 7 次提交
    • G
      tipc: extend sock diag for group communication · a1be5a20
      GhantaKrishnamurthy MohanKrishna 提交于
      This commit extends the existing TIPC socket diagnostics framework
      for information related to TIPC group communication.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1be5a20
    • H
      net/smc: add SMC-D diag support · 4b1b7d3b
      Hans Wippel 提交于
      This patch adds diag support for SMC-D.
      Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b1b7d3b
    • H
      net/smc: add pnetid support for SMC-D and ISM · 1619f770
      Hans Wippel 提交于
      SMC-D relies on PNETIDs to find usable SMC-D/ISM devices for a SMC
      connection. This patch adds SMC-D/ISM support to the current PNETID
      implementation.
      Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1619f770
    • H
      net/smc: add base infrastructure for SMC-D and ISM · c6ba7c9b
      Hans Wippel 提交于
      SMC supports two variants: SMC-R and SMC-D. For data transport, SMC-R
      uses RDMA devices, SMC-D uses so-called Internal Shared Memory (ISM)
      devices. An ISM device only allows shared memory communication between
      SMC instances on the same machine. For example, this allows virtual
      machines on the same host to communicate via SMC without RDMA devices.
      
      This patch adds the base infrastructure for SMC-D and ISM devices to
      the existing SMC code. It contains the following:
      
      * ISM driver interface:
        This interface allows an ISM driver to register ISM devices in SMC. In
        the process, the driver provides a set of device ops for each device.
        SMC uses these ops to execute SMC specific operations on or transfer
        data over the device.
      
      * Core SMC-D link group, connection, and buffer support:
        Link groups, SMC connections and SMC buffers (in smc_core) are
        extended to support SMC-D.
      
      * SMC type checks:
        Some type checks are added to prevent using SMC-R specific code for
        SMC-D and vice versa.
      
      To actually use SMC-D, additional changes to pnetid, CLC, CDC, etc. are
      required. These are added in follow-up patches.
      Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Suggested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6ba7c9b
    • U
      net/smc: add pnetid support · 0afff91c
      Ursula Braun 提交于
      s390 hardware supports the definition of a so-call Physical NETwork
      IDentifier (short PNETID) per network device port. These PNETIDS
      can be used to identify network devices that are attached to the same
      physical network (broadcast domain).
      
      On s390 try to use the PNETID of the ethernet device port used for
      initial connecting, and derive the IB device port used for SMC RDMA
      traffic.
      
      On platforms without PNETID support fall back to the existing
      solution of a configured pnet table.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0afff91c
    • Y
      tcp: add new SNMP counter for drops when try to queue in rcv queue · ea5d0c32
      Yafang Shao 提交于
      When sk_rmem_alloc is larger than the receive buffer and we can't
      schedule more memory for it, the skb will be dropped.
      
      In above situation, if this skb is put into the ofo queue,
      LINUX_MIB_TCPOFODROP is incremented to track it.
      
      While if this skb is put into the receive queue, there's no record.
      So a new SNMP counter is introduced to track this behavior.
      
      LINUX_MIB_TCPRCVQDROP:  Number of packets meant to be queued in rcv queue
      			but dropped because socket rcvbuf limit hit.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea5d0c32
    • D
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann 提交于
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      85782e03
  4. 29 6月, 2018 10 次提交
    • S
      net/sched: add tunnel option support to act_tunnel_key · 0ed5269f
      Simon Horman 提交于
      Allow setting tunnel options using the act_tunnel_key action.
      
      Options are expressed as class:type:data and multiple options
      may be listed using a comma delimiter.
      
       # ip link add name geneve0 type geneve dstport 0 external
       # tc qdisc add dev eth0 ingress
       # tc filter add dev eth0 protocol ip parent ffff: \
           flower indev eth0 \
              ip_proto udp \
              action tunnel_key \
                  set src_ip 10.0.99.192 \
                  dst_ip 10.0.99.193 \
                  dst_port 6081 \
                  id 11 \
                  geneve_opts 0102:80:00800022,0102:80:00800022 \
          action mirred egress redirect dev geneve0
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed5269f
    • P
      net: check tunnel option type in tunnel flags · 256c87c1
      Pieter Jansen van Vuuren 提交于
      Check the tunnel option type stored in tunnel flags when creating options
      for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
      options on interfaces that are not associated with them.
      
      Make sure all users of the infrastructure set correct flags, for the BPF
      helper we have to set all bits to keep backward compatibility.
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      256c87c1
    • J
      sg: remove ->sg_magic member · 9544bc53
      Jens Axboe 提交于
      This was introduced more than a decade ago when sg chaining was
      added, but we never really caught anything with it. The scatterlist
      entry size can be critical, since drivers allocate it, so remove
      the magic member. Recently it's been triggering allocation stalls
      and failures in NVMe.
      Tested-by: NJordan Glover <Golden_Miller83@protonmail.ch>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9544bc53
    • A
      aio: mark __aio_sigset::sigmask const · 2cd3ae21
      Avi Kivity 提交于
      io_pgetevents() will not change the signal mask.  Mark it const to make
      it clear and to reduce the need for casts in user code.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAvi Kivity <avi@scylladb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      [hch: reapply the patch that got incorrectly reverted]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cd3ae21
    • X
      sctp: add support for SCTP_REUSE_PORT sockopt · b0e9a2fe
      Xin Long 提交于
      This feature is actually already supported by sk->sk_reuse which can be
      set by socket level opt SO_REUSEADDR. But it's not working exactly as
      RFC6458 demands in section 8.1.27, like:
      
        - This option only supports one-to-one style SCTP sockets
        - This socket option must not be used after calling bind()
          or sctp_bindx().
      
      Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
      Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
      work in linux.
      
      To separate it from the socket level version, this patch adds 'reuse' in
      sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
      setup limitations that are needed when it is being enabled.
      
      "It should be noted that the behavior of the socket-level socket option
      to reuse ports and/or addresses for SCTP sockets is unspecified", so it
      leaves SO_REUSEADDR as is for the compatibility.
      
      Note that the name SCTP_REUSE_PORT is somewhat confusing, as its
      functionality is nearly identical to SO_REUSEADDR, but with some
      extra restrictions. Here it uses 'reuse' in sctp_sock instead of
      'reuseport'. As for sk->sk_reuseport support for SCTP, it will be
      added in another patch.
      
      Thanks to Neil to make this clear.
      
      v1->v2:
        - add sctp_sk->reuse to separate it from the socket level version.
      v2->v3:
        - improve changelog according to Marcelo's suggestion.
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e9a2fe
    • T
      ila: Flush netlink command to clear xlat table · b6e71bde
      Tom Herbert 提交于
      Add ILA_CMD_FLUSH netlink command to clear the ILA translation table.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6e71bde
    • D
      bpf: Change bpf_fib_lookup to return lookup status · 4c79579b
      David Ahern 提交于
      For ACLs implemented using either FIB rules or FIB entries, the BPF
      program needs the FIB lookup status to be able to drop the packet.
      Since the bpf_fib_lookup API has not reached a released kernel yet,
      change the return code to contain an encoding of the FIB lookup
      result and return the nexthop device index in the params struct.
      
      In addition, inform the BPF program of any post FIB lookup reason as
      to why the packet needs to go up the stack.
      
      The fib result for unicast routes must have an egress device, so remove
      the check that it is non-NULL.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4c79579b
    • S
      include/linux/dax.h: dax_iomap_fault() returns vm_fault_t · f77bc3a8
      Souptick Joarder 提交于
      Commit 1c8f4220 ("mm: change return type to vm_fault_t") missed a
      conversion.  It's not a big problem at present because mainline is still
      using
      
      	typedef int vm_fault_t;
      
      Fixes: 1c8f4220 ("mm: change return type to vm_fault_t")
      Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f77bc3a8
    • M
      slub: fix failure when we delete and create a slab cache · d50d82fa
      Mikulas Patocka 提交于
      In kernel 4.17 I removed some code from dm-bufio that did slab cache
      merging (commit 21bb1327: "dm bufio: remove code that merges slab
      caches") - both slab and slub support merging caches with identical
      attributes, so dm-bufio now just calls kmem_cache_create and relies on
      implicit merging.
      
      This uncovered a bug in the slub subsystem - if we delete a cache and
      immediatelly create another cache with the same attributes, it fails
      because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
      offloads freeing the cache to a workqueue - and if we create the new
      cache before the workqueue runs, it complains because of duplicate
      filename in sysfs.
      
      This patch fixes the bug by moving the call of kobject_del from
      sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
      while we hold slab_mutex - so that the sysfs entry is deleted before a
      cache with the same attributes could be created.
      
      Running device-mapper-test-suite with:
      
        dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
      
      triggered:
      
        Buffer I/O error on dev dm-0, logical block 1572848, async page read
        device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
        device-mapper: thin: 253:1: aborting current metadata transaction
        sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
        CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
        Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
        Workqueue: dm-thin do_worker [dm_thin_pool]
        Call Trace:
         dump_stack+0x5a/0x73
         sysfs_warn_dup+0x58/0x70
         sysfs_create_dir_ns+0x77/0x80
         kobject_add_internal+0xba/0x2e0
         kobject_init_and_add+0x70/0xb0
         sysfs_slab_add+0xb1/0x250
         __kmem_cache_create+0x116/0x150
         create_cache+0xd9/0x1f0
         kmem_cache_create_usercopy+0x1c1/0x250
         kmem_cache_create+0x18/0x20
         dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
         dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
         __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
         dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
         metadata_operation_failed+0x59/0x100 [dm_thin_pool]
         alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
         process_cell+0x2a3/0x550 [dm_thin_pool]
         do_worker+0x28d/0x8f0 [dm_thin_pool]
         process_one_work+0x171/0x370
         worker_thread+0x49/0x3f0
         kthread+0xf8/0x130
         ret_from_fork+0x35/0x40
        kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
        kmem_cache_create(dm_bufio_buffer-16) failed with error -17
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reported-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d50d82fa
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  5. 28 6月, 2018 3 次提交
  6. 27 6月, 2018 3 次提交