1. 10 8月, 2018 1 次提交
  2. 01 8月, 2018 2 次提交
    • B
      NFSv4 client live hangs after live data migration recovery · 0f90be13
      Bill Baker 提交于
      After a live data migration event at the NFS server, the client may send
      I/O requests to the wrong server, causing a live hang due to repeated
      recovery events.  On the wire, this will appear as an I/O request failing
      with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
      NFS4ERR_BADSSESSION is returned because the session ID being used was
      issued by the other server and is not valid at the old server.
      
      The failure is caused by async worker threads having cached the transport
      (xprt) in the rpc_task structure.  After the migration recovery completes,
      the task is redispatched and the task resends the request to the wrong
      server based on the old value still present in tk_xprt.
      
      The solution is to recompute the tk_xprt field of the rpc_task structure
      so that the request goes to the correct server.
      Signed-off-by: NBill Baker <bill.baker@oracle.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NHelen Chao <helen.chao@oracle.com>
      Fixes: fb43d172 ("SUNRPC: Use the multipath iterator to assign a ...")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0f90be13
    • D
      sunrpc: Change rpc_print_iostats to rpc_clnt_show_stats and handle rpc_clnt clones · 016583d7
      Dave Wysochanski 提交于
      The existing rpc_print_iostats has a few shortcomings.  First, the naming
      is not consistent with other functions in the kernel that display stats.
      Second, it is really displaying stats for an rpc_clnt structure as it
      displays both xprt stats and per-op stats.  Third, it does not handle
      rpc_clnt clones, which is important for the one in-kernel tree caller
      of this function, the NFS client's nfs_show_stats function.
      
      Fix all of the above by renaming the rpc_print_iostats to
      rpc_clnt_show_stats and looping through any rpc_clnt clones via
      cl_parent.
      
      Once this interface is fixed, this addresses a problem with NFSv4.
      Before this patch, the /proc/self/mountstats always showed incorrect
      counts for NFSv4 lease and session related opcodes such as SEQUENCE,
      RENEW, SETCLIENTID, CREATE_SESSION, etc.  These counts were always 0
      even though many ops would go over the wire.  The reason for this is
      there are multiple rpc_clnt structures allocated for any given NFSv4
      mount, and inside nfs_show_stats() we callled into rpc_print_iostats()
      which only handled one of them, nfs_server->client.  Fix these counts
      by calling sunrpc's new rpc_clnt_show_stats() function, which handles
      cloned rpc_clnt structs and prints the stats together.
      
      Note that one side-effect of the above is that multiple mounts from
      the same NFS server will show identical counts in the above ops due
      to the fact the one rpc_clnt (representing the NFSv4 client state)
      is shared across mounts.
      Signed-off-by: NDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      016583d7
  3. 31 7月, 2018 1 次提交
  4. 27 7月, 2018 1 次提交
  5. 24 7月, 2018 1 次提交
    • W
      ipv6: use fib6_info_hold_safe() when necessary · e873e4b9
      Wei Wang 提交于
      In the code path where only rcu read lock is held, e.g. in the route
      lookup code path, it is not safe to directly call fib6_info_hold()
      because the fib6_info may already have been deleted but still exists
      in the rcu grace period. Holding reference to it could cause double
      free and crash the kernel.
      
      This patch adds a new function fib6_info_hold_safe() and replace
      fib6_info_hold() in all necessary places.
      
      Syzbot reported 3 crash traces because of this. One of them is:
      8021q: adding VLAN 0 to HW filter on device team0
      IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready
      dst_release: dst:(____ptrval____) refcnt:-1
      dst_release: dst:(____ptrval____) refcnt:-2
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold include/net/dst.h:239 [inline]
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      dst_release: dst:(____ptrval____) refcnt:-1
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       panic+0x238/0x4e7 kernel/panic.c:184
      dst_release: dst:(____ptrval____) refcnt:-2
      dst_release: dst:(____ptrval____) refcnt:-3
       __warn.cold.8+0x163/0x1ba kernel/panic.c:536
      dst_release: dst:(____ptrval____) refcnt:-4
       report_bug+0x252/0x2d0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
      dst_release: dst:(____ptrval____) refcnt:-5
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
      RIP: 0010:dst_hold include/net/dst.h:239 [inline]
      RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78
      RSP: 0018:ffff8801a8fcf178 EFLAGS: 00010293
      RAX: ffff8801a8eba5c0 RBX: 0000000000000000 RCX: ffffffff869511e6
      RDX: 0000000000000000 RSI: ffffffff869515b6 RDI: 0000000000000005
      RBP: ffff8801a8fcf2c8 R08: ffff8801a8eba5c0 R09: ffffed0035ac8338
      R10: ffffed0035ac8338 R11: ffff8801ad6419c3 R12: ffff8801a8fcf720
      R13: ffff8801a8fcf6a0 R14: ffff8801ad6419c0 R15: ffff8801ad641980
       ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768
       udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       ___sys_sendmsg+0x51d/0x930 net/socket.c:2125
       __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220
       __do_sys_sendmmsg net/socket.c:2249 [inline]
       __se_sys_sendmmsg net/socket.c:2246 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x446ba9
      Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fb39a469da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000006dcc54 RCX: 0000000000446ba9
      RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
      RBP: 00000000006dcc50 R08: 00007fb39a46a700 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 45c828efc7a64843
      R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0000000000000001
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: syzbot+902e2a1bcd4f7808cef5@syzkaller.appspotmail.com
      Reported-by: syzbot+8ae62d67f647abeeceb9@syzkaller.appspotmail.com
      Reported-by: syzbot+3f08feb14086930677d0@syzkaller.appspotmail.com
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e873e4b9
  6. 23 7月, 2018 1 次提交
  7. 22 7月, 2018 4 次提交
    • D
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern 提交于
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
      
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
      
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      LLA).
      
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      lookups.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b711ed
    • Y
      bpfilter: Fix mismatch in function argument types · f95de8aa
      YueHaibing 提交于
      Fix following warning:
      net/ipv4/bpfilter/sockopt.c:28:5: error: symbol 'bpfilter_ip_set_sockopt' redeclared with different type
      net/ipv4/bpfilter/sockopt.c:34:5: error: symbol 'bpfilter_ip_get_sockopt' redeclared with different type
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95de8aa
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  8. 21 7月, 2018 2 次提交
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
    • Y
      tcp: do not cancel delay-AcK on DCTCP special ACK · 27cde44a
      Yuchung Cheng 提交于
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27cde44a
  9. 20 7月, 2018 3 次提交
  10. 19 7月, 2018 3 次提交
    • T
      net/mlx5: Fix QP fragmented buffer allocation · d7037ad7
      Tariq Toukan 提交于
      Fix bad alignment of SQ buffer in fragmented QP allocation.
      It should start directly after RQ buffer ends.
      
      Take special care of the end case where the RQ buffer does not occupy
      a whole page. RQ size is a power of two, so would be the case only for
      small RQ sizes (RQ size < PAGE_SIZE).
      
      Fix wrong assignments for sqb->size (mistakenly assigned RQ size),
      and for npages value of RQ and SQ.
      
      Fixes: 3a2f7033 ("net/mlx5: Use order-0 allocations for all WQ types")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      d7037ad7
    • C
      ipv6: fix useless rol32 call on hash · 169dc027
      Colin Ian King 提交于
      The rol32 call is currently rotating hash but the rol'd value is
      being discarded. I believe the current code is incorrect and hash
      should be assigned the rotated value returned from rol32.
      
      Thanks to David Lebrun for spotting this.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      169dc027
    • S
      PCI: OF: Fix I/O space page leak · a5fb9fb0
      Sergei Shtylyov 提交于
      When testing the R-Car PCIe driver on the Condor board, if the PCIe PHY
      driver was left disabled, the kernel crashed with this BUG:
      
        kernel BUG at lib/ioremap.c:72!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 39 Comm: kworker/0:1 Not tainted 4.17.0-dirty #1092
        Hardware name: Renesas Condor board based on r8a77980 (DT)
        Workqueue: events deferred_probe_work_func
        pstate: 80000005 (Nzcv daif -PAN -UAO)
        pc : ioremap_page_range+0x370/0x3c8
        lr : ioremap_page_range+0x40/0x3c8
        sp : ffff000008da39e0
        x29: ffff000008da39e0 x28: 00e8000000000f07
        x27: ffff7dfffee00000 x26: 0140000000000000
        x25: ffff7dfffef00000 x24: 00000000000fe100
        x23: ffff80007b906000 x22: ffff000008ab8000
        x21: ffff000008bb1d58 x20: ffff7dfffef00000
        x19: ffff800009c30fb8 x18: 0000000000000001
        x17: 00000000000152d0 x16: 00000000014012d0
        x15: 0000000000000000 x14: 0720072007200720
        x13: 0720072007200720 x12: 0720072007200720
        x11: 0720072007300730 x10: 00000000000000ae
        x9 : 0000000000000000 x8 : ffff7dffff000000
        x7 : 0000000000000000 x6 : 0000000000000100
        x5 : 0000000000000000 x4 : 000000007b906000
        x3 : ffff80007c61a880 x2 : ffff7dfffeefffff
        x1 : 0000000040000000 x0 : 00e80000fe100f07
        Process kworker/0:1 (pid: 39, stack limit = 0x        (ptrval))
        Call trace:
         ioremap_page_range+0x370/0x3c8
         pci_remap_iospace+0x7c/0xac
         pci_parse_request_of_pci_ranges+0x13c/0x190
         rcar_pcie_probe+0x4c/0xb04
         platform_drv_probe+0x50/0xbc
         driver_probe_device+0x21c/0x308
         __device_attach_driver+0x98/0xc8
         bus_for_each_drv+0x54/0x94
         __device_attach+0xc4/0x12c
         device_initial_probe+0x10/0x18
         bus_probe_device+0x90/0x98
         deferred_probe_work_func+0xb0/0x150
         process_one_work+0x12c/0x29c
         worker_thread+0x200/0x3fc
         kthread+0x108/0x134
         ret_from_fork+0x10/0x18
        Code: f9004ba2 54000080 aa0003fb 17ffff48 (d4210000)
      
      It turned out that pci_remap_iospace() wasn't undone when the driver's
      probe failed, and since devm_phy_optional_get() returned -EPROBE_DEFER,
      the probe was retried, finally causing the BUG due to trying to remap
      already remapped pages.
      
      Introduce the devm_pci_remap_iospace() managed API and replace the
      pci_remap_iospace() call with it to fix the bug.
      
      Fixes: dbf9826d ("PCI: generic: Convert to DT resource parsing API")
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      [lorenzo.pieralisi@arm.com: split commit/updated the commit log]
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      a5fb9fb0
  11. 18 7月, 2018 2 次提交
    • C
      aio: don't expose __aio_sigset in uapi · 9ba546c0
      Christoph Hellwig 提交于
      glibc uses a different defintion of sigset_t than the kernel does,
      and the current version would pull in both.  To fix this just do not
      expose the type at all - this somewhat mirrors pselect() where we
      do not even have a type for the magic sigmask argument, but just
      use pointer arithmetics.
      
      Fixes: 7a074e96 ("aio: implement io_pgetevents")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NAdrian Reber <adrian@lisas.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9ba546c0
    • T
      netfilter: nf_tables: fix jumpstack depth validation · 26b2f552
      Taehee Yoo 提交于
      The level of struct nft_ctx is updated by nf_tables_check_loops().  That
      is used to validate jumpstack depth. But jumpstack validation routine
      doesn't update and validate recursively.  So, in some cases, chain depth
      can be bigger than the NFT_JUMP_STACK_SIZE.
      
      After this patch, The jumpstack validation routine is located in the
      nft_chain_validate(). When new rules or new set elements are added, the
      nft_table_validate() is called by the nf_tables_newrule and the
      nf_tables_newsetelem. The nft_table_validate() calls the
      nft_chain_validate() that visit all their children chains recursively.
      So it can update depth of chain certainly.
      
      Reproducer:
         %cat ./test.sh
         #!/bin/bash
         nft add table ip filter
         nft add chain ip filter input { type filter hook input priority 0\; }
         for ((i=0;i<20;i++)); do
      	nft add chain ip filter a$i
         done
      
         nft add rule ip filter input jump a1
      
         for ((i=0;i<10;i++)); do
      	nft add rule ip filter a$i jump a$((i+1))
         done
      
         for ((i=11;i<19;i++)); do
      	nft add rule ip filter a$i jump a$((i+1))
         done
      
         nft add rule ip filter a10 jump a11
      
      Result:
      [  253.931782] WARNING: CPU: 1 PID: 0 at net/netfilter/nf_tables_core.c:186 nft_do_chain+0xacc/0xdf0 [nf_tables]
      [  253.931915] Modules linked in: nf_tables nfnetlink ip_tables x_tables
      [  253.932153] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #48
      [  253.932153] RIP: 0010:nft_do_chain+0xacc/0xdf0 [nf_tables]
      [  253.932153] Code: 83 f8 fb 0f 84 c7 00 00 00 e9 d0 00 00 00 83 f8 fd 74 0e 83 f8 ff 0f 84 b4 00 00 00 e9 bd 00 00 00 83 bd 64 fd ff ff 0f 76 09 <0f> 0b 31 c0 e9 bc 02 00 00 44 8b ad 64 fd
      [  253.933807] RSP: 0018:ffff88011b807570 EFLAGS: 00010212
      [  253.933807] RAX: 00000000fffffffd RBX: ffff88011b807660 RCX: 0000000000000000
      [  253.933807] RDX: 0000000000000010 RSI: ffff880112b39d78 RDI: ffff88011b807670
      [  253.933807] RBP: ffff88011b807850 R08: ffffed0023700ece R09: ffffed0023700ecd
      [  253.933807] R10: ffff88011b80766f R11: ffffed0023700ece R12: ffff88011b807898
      [  253.933807] R13: ffff880112b39d80 R14: ffff880112b39d60 R15: dffffc0000000000
      [  253.933807] FS:  0000000000000000(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  253.933807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  253.933807] CR2: 00000000014f1008 CR3: 000000006b216000 CR4: 00000000001006e0
      [  253.933807] Call Trace:
      [  253.933807]  <IRQ>
      [  253.933807]  ? sched_clock_cpu+0x132/0x170
      [  253.933807]  ? __nft_trace_packet+0x180/0x180 [nf_tables]
      [  253.933807]  ? sched_clock_cpu+0x132/0x170
      [  253.933807]  ? debug_show_all_locks+0x290/0x290
      [  253.933807]  ? __lock_acquire+0x4835/0x4af0
      [  253.933807]  ? inet_ehash_locks_alloc+0x1a0/0x1a0
      [  253.933807]  ? unwind_next_frame+0x159e/0x1840
      [  253.933807]  ? __read_once_size_nocheck.constprop.4+0x5/0x10
      [  253.933807]  ? nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
      [  253.933807]  ? nft_do_chain+0x5/0xdf0 [nf_tables]
      [  253.933807]  nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
      [  253.933807]  ? nft_do_chain_arp+0xb0/0xb0 [nf_tables]
      [  253.933807]  ? __lock_is_held+0x9d/0x130
      [  253.933807]  nf_hook_slow+0xc4/0x150
      [  253.933807]  ip_local_deliver+0x28b/0x380
      [  253.933807]  ? ip_call_ra_chain+0x3e0/0x3e0
      [  253.933807]  ? ip_rcv_finish+0x1610/0x1610
      [  253.933807]  ip_rcv+0xbcc/0xcc0
      [  253.933807]  ? debug_show_all_locks+0x290/0x290
      [  253.933807]  ? ip_local_deliver+0x380/0x380
      [  253.933807]  ? __lock_is_held+0x9d/0x130
      [  253.933807]  ? ip_local_deliver+0x380/0x380
      [  253.933807]  __netif_receive_skb_core+0x1c9c/0x2240
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      26b2f552
  12. 17 7月, 2018 5 次提交
    • S
      tcp: Fix broken repair socket window probe patch · 31048d7a
      Stefan Baranoff 提交于
      Correct previous bad attempt at allowing sockets to come out of TCP
      repair without sending window probes. To avoid changing size of
      the repair variable in struct tcp_sock, this lets the decision for
      sending probes or not to be made when coming out of repair by
      introducing two ways to turn it off.
      
      v2:
      * Remove erroneous comment; defines now make behavior clear
      
      Fixes: 70b7ff13 ("tcp: allow user to create repair socket without window probes")
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31048d7a
    • R
      net/ethernet/freescale/fman: fix cross-build error · c1334597
      Randy Dunlap 提交于
        CC [M]  drivers/net/ethernet/freescale/fman/fman.o
      In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
      ../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
      ../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 'clrsetbits_be32' [-Werror=implicit-function-declaration]
        clrsetbits_be32(&guts->dmacr, 3 << shift, device << shift);
        ^~~~~~~~~~~~~~~
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Madalin Bucur <madalin.bucur@nxp.com>
      Cc: netdev@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1334597
    • H
      ipv6/mcast: init as INCLUDE when join SSM INCLUDE group · c7ea20c9
      Hangbin Liu 提交于
      This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
      join source group". From RFC3810, part 6.1:
      
         If no per-interface state existed for that
         multicast address before the change (i.e., the change consisted of
         creating a new per-interface record), or if no state exists after the
         change (i.e., the change consisted of deleting a per-interface
         record), then the "non-existent" state is considered to have an
         INCLUDE filter mode and an empty source list.
      
      Which means a new multicast group should start with state IN(). Currently,
      for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
      then ip6_mc_source(), which will trigger a TO_IN() message instead of
      ALLOW().
      
      The issue was exposed by commit a052517a ("net/multicast: should not
      send source list records when have filter mode change"). Before this change,
      we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).
      
      Fix it by adding a new parameter to init group mode. Also add some wrapper
      functions to avoid changing too much code.
      
      v1 -> v2:
      In the first version I only cleared the group change record. But this is not
      enough. Because when a new group join, it will init as EXCLUDE and trigger
      a filter mode change in ip/ip6_mc_add_src(), which will clear all source
      addresses sf_crcount. This will prevent early joined address sending state
      change records if multi source addressed joined at the same time.
      
      In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
      JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
      for IPv4 and IPv6.
      
      There is also a difference between v4 and v6 version. For IPv6, when the
      interface goes down and up, we will send correct state change record with
      unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
      completed, we resend the change record TO_IN() in mld_send_initial_cr().
      Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().
      
      Fixes: a052517a ("net/multicast: should not send source list records when have filter mode change")
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7ea20c9
    • H
      ipv4/igmp: init group mode as INCLUDE when join source group · 6e2059b5
      Hangbin Liu 提交于
      Based on RFC3376 5.1
         If no interface
         state existed for that multicast address before the change (i.e., the
         change consisted of creating a new per-interface record), or if no
         state exists after the change (i.e., the change consisted of deleting
         a per-interface record), then the "non-existent" state is considered
         to have a filter mode of INCLUDE and an empty source list.
      
      Which means a new multicast group should start with state IN().
      
      Function ip_mc_join_group() works correctly for IGMP ASM(Any-Source Multicast)
      mode. It adds a group with state EX() and inits crcount to mc_qrv,
      so the kernel will send a TO_EX() report message after adding group.
      
      But for IGMPv3 SSM(Source-specific multicast) JOIN_SOURCE_GROUP mode, we
      split the group joining into two steps. First we join the group like ASM,
      i.e. via ip_mc_join_group(). So the state changes from IN() to EX().
      
      Then we add the source-specific address with INCLUDE mode. So the state
      changes from EX() to IN(A).
      
      Before the first step sends a group change record, we finished the second
      step. So we will only send the second change record. i.e. TO_IN(A).
      
      Regarding the RFC stands, we should actually send an ALLOW(A) message for
      SSM JOIN_SOURCE_GROUP as the state should mimic the 'IN() to IN(A)'
      transition.
      
      The issue was exposed by commit a052517a ("net/multicast: should not
      send source list records when have filter mode change"). Before this change,
      we used to send both ALLOW(A) and TO_IN(A). After this change we only send
      TO_IN(A).
      
      Fix it by adding a new parameter to init group mode. Also add new wrapper
      functions so we don't need to change too much code.
      
      v1 -> v2:
      In my first version I only cleared the group change record. But this is not
      enough. Because when a new group join, it will init as EXCLUDE and trigger
      an filter mode change in ip/ip6_mc_add_src(), which will clear all source
      addresses' sf_crcount. This will prevent early joined address sending state
      change records if multi source addressed joined at the same time.
      
      In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
      JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
      for IPv4 and IPv6.
      
      Fixes: a052517a ("net/multicast: should not send source list records when have filter mode change")
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e2059b5
    • P
      mm: don't do zero_resv_unavail if memmap is not allocated · d1b47a7c
      Pavel Tatashin 提交于
      Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
      x86-32.
      
      The cause is that we access struct pages before they are allocated when
      CONFIG_FLAT_NODE_MEM_MAP is used.
      
      free_area_init_nodes()
        zero_resv_unavail()
          mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced
        free_area_init_node()
          if CONFIG_FLAT_NODE_MEM_MAP
            alloc_node_mem_map()
              memblock_virt_alloc_node_nopanic() <- struct page alloced here
      
      On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory
      that it returns, so we do not need to do zero_resv_unavail() here.
      
      Fixes: e181ae0c ("mm: zero unavailable pages before memmap init")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: NMatt Hart <matt@mattface.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1b47a7c
  13. 15 7月, 2018 1 次提交
  14. 14 7月, 2018 2 次提交
  15. 13 7月, 2018 1 次提交
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
  16. 11 7月, 2018 5 次提交
    • A
      drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open() · b4e7a7a8
      Al Viro 提交于
      Failure of ->open() should *not* be followed by fput().  Fixed by
      using filp_clone_open(), which gets the cleanups right.
      
      Cc: stable@vger.kernel.org
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4e7a7a8
    • M
      rseq: Remove unused types_32_64.h uapi header · 4f4c0acd
      Mathieu Desnoyers 提交于
      This header was introduced in the 4.18 merge window, and rseq does
      not need it anymore. Nuke it before the final release.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-6-mathieu.desnoyers@efficios.com
      4f4c0acd
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0
    • M
      rseq: uapi: Update uapi comments · 0fb9a1ab
      Mathieu Desnoyers 提交于
      Update rseq uapi header comments to reflect that user-space need to do
      thread-local loads/stores from/to the struct rseq fields.
      
      As a consequence of this added requirement, the kernel does not need
      to perform loads/stores with single-copy atomicity.
      
      Update the comment associated to the "flags" fields to describe
      more accurately that it's only useful to facilitate single-stepping
      through rseq critical sections with debuggers.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
      0fb9a1ab
    • M
      rseq: Use __u64 for rseq_cs fields, validate user inputs · e96d7135
      Mathieu Desnoyers 提交于
      Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
      fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
      that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
      consistent behavior for a 32-bit binary executed on 32-bit kernels and in
      compat mode on 64-bit kernels.
      
      Validating the value of abort_ip field to be below TASK_SIZE ensures the
      kernel don't return to an invalid address when returning to userspace
      after an abort. I don't fully trust each architecture code to consistently
      deal with invalid return addresses.
      
      Validating the value of the start_ip and post_commit_offset fields
      prevents overflow on arithmetic performed on those values, used to
      check whether abort_ip is within the rseq critical section.
      
      If validation fails, the process is killed with a segmentation fault.
      
      When the signature encountered before abort_ip does not match the expected
      signature, return -EINVAL rather than -EPERM to be consistent with other
      input validation return codes from rseq_get_rseq_cs().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
      e96d7135
  17. 09 7月, 2018 1 次提交
    • R
      bpf: include errno.h from bpf-cgroup.h · f292b87d
      Roman Gushchin 提交于
      Commit fdb5c453 ("bpf: fix attach type BPF_LIRC_MODE2 dependency
      wrt CONFIG_CGROUP_BPF") caused some build issues, detected by 0-DAY
      kernel test infrastructure.
      
      The problem is that cgroup_bpf_prog_attach/detach/query() functions
      can return -EINVAL error code, which is not defined. Fix this adding
      errno.h to includes.
      
      Fixes: fdb5c453 ("bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Sean Young <sean@mess.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f292b87d
  18. 08 7月, 2018 2 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
  19. 07 7月, 2018 2 次提交
    • X
      uio: change to use the mutex lock instead of the spin lock · 543af586
      Xiubo Li 提交于
      We are hitting a regression with the following commit:
      
      commit a93e7b33
      Author: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
      Date:   Mon May 14 13:32:23 2018 +1200
      
          uio: Prevent device destruction while fds are open
      
      The problem is the addition of spin_lock_irqsave in uio_write. This
      leads to hitting  uio_write -> copy_from_user -> _copy_from_user ->
      might_fault and the logs filling up with sleeping warnings.
      
      I also noticed some uio drivers allocate memory, sleep, grab mutexes
      from callouts like open() and release and uio is now doing
      spin_lock_irqsave while calling them.
      Reported-by: NMike Christie <mchristi@redhat.com>
      CC: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
      Reviewed-by: NHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: NXiubo Li <xiubli@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      543af586
    • D
      net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used · 38230a3e
      Davide Caratti 提交于
      the control action in the common member of struct tcf_tunnel_key must be a
      valid value, as it can contain the chain index when 'goto chain' is used.
      Ensure that the control action can be read as x->tcfa_action, when x is a
      pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
      prevent the following command:
      
       # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
       > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1
      
      from causing a NULL dereference when a matching packet is received:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 0 PID: 3491 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
       Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
       RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
       R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
       R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
       FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
       Call Trace:
        <IRQ>
        fl_classify+0x1ad/0x1c0 [cls_flower]
        ? __update_load_avg_se.isra.47+0x1ca/0x1d0
        ? __update_load_avg_se.isra.47+0x1ca/0x1d0
        ? update_load_avg+0x665/0x690
        ? update_load_avg+0x665/0x690
        ? kmem_cache_alloc+0x38/0x1c0
        tcf_classify+0x89/0x140
        __netif_receive_skb_core+0x5ea/0xb70
        ? enqueue_entity+0xd0/0x270
        ? process_backlog+0x97/0x150
        process_backlog+0x97/0x150
        net_rx_action+0x14b/0x3e0
        __do_softirq+0xde/0x2b4
        do_softirq_own_stack+0x2a/0x40
        </IRQ>
        do_softirq.part.18+0x49/0x50
        __local_bh_enable_ip+0x49/0x50
        __dev_queue_xmit+0x4ab/0x8a0
        ? wait_woken+0x80/0x80
        ? packet_sendmsg+0x38f/0x810
        ? __dev_queue_xmit+0x8a0/0x8a0
        packet_sendmsg+0x38f/0x810
        sock_sendmsg+0x36/0x40
        __sys_sendto+0x10e/0x140
        ? do_vfs_ioctl+0xa4/0x630
        ? syscall_trace_enter+0x1df/0x2e0
        ? __audit_syscall_exit+0x22a/0x290
        __x64_sys_sendto+0x24/0x30
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fd67e18dc93
       Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
       RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
       RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
       RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
       R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
       Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
        mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
       CR2: 0000000000000000
       ---[ end trace 1ab8b5b5d4639dfc ]---
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
       RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
       R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
       R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
       FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fixes: d0f6dd8a ("net/sched: Introduce act_tunnel_key")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38230a3e