1. 19 8月, 2016 13 次提交
    • E
      tcp: defer sacked assignment · dca0aaf8
      Eric Dumazet 提交于
      While chasing tcp_xmit_retransmit_queue() kasan issue, I found
      that we could avoid reading sacked field of skb that we wont send,
      possibly removing one cache line miss.
      
      Very minor change in slow path, but why not ? ;)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca0aaf8
    • N
      net: bridge: export vlan flags with the stats · 61ba1a2d
      Nikolay Aleksandrov 提交于
      Use one of the vlan xstats padding fields to export the vlan flags. This is
      needed in order to be able to distinguish between master (bridge) and port
      vlan entries in user-space when dumping the bridge vlan stats.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61ba1a2d
    • N
      net: bridge: consolidate bridge and port linkxstats calls · d5ff8c41
      Nikolay Aleksandrov 提交于
      In the bridge driver we usually have the same function working for both
      port and bridge. In order to follow that logic and also avoid code
      duplication, consolidate the bridge_ and brport_ linkxstats calls into
      one since they share most of their code. As a side effect this allows us
      to dump the vlan stats also via the slave call which is in preparation for
      the upcoming per-port vlan stats and vlan flag dumping.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5ff8c41
    • H
      net_sched: act_vlan: Add priority option · 956af371
      Hadar Hen Zion 提交于
      The current vlan push action supports only vid and protocol options.
      Add priority option.
      
      Example script that adds vlan push action with vid and
      priority:
      
      tc filter add dev veth0 protocol ip parent ffff: \
      	   flower \
      	   	indev veth0 \
      	   action vlan push id 100 priority 5
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      956af371
    • H
      net_sched: flower: Add vlan support · 9399ae9a
      Hadar Hen Zion 提交于
      Enhance flower to support 802.1Q vlan protocol classification.
      Currently, the supported fields are vlan_id and vlan_priority.
      
      Example:
      
      	# add a flower filter with vlan id and priority classification
      	tc filter add dev ens4f0 protocol 802.1Q parent ffff: \
      		flower \
      		indev ens4f0 \
      		vlan_ethtype ipv4 \
      		vlan_id 100 \
      		vlan_prio 3 \
      	action vlan pop
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9399ae9a
    • H
      net_sched: flower: Avoid dissection of unmasked keys · 339ba878
      Hadar Hen Zion 提交于
      The current flower implementation checks the mask range and set all the
      keys included in that range as "used_keys", even if a specific key in
      the range has a zero mask.
      
      This behavior can cause a false positive return value of
      dissector_uses_key function and unnecessary dissection in
      __skb_flow_dissect.
      
      This patch checks explicitly the mask of each key and "used_keys" will
      be set accordingly.
      
      Fixes: 77b9900e ('tc: introduce Flower classifier')
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      339ba878
    • H
      flow_dissector: Get vlan priority in addition to vlan id · f6a66927
      Hadar Hen Zion 提交于
      Add vlan priority check to the flow dissector by adding new flow
      dissector struct, flow_dissector_key_vlan which includes vlan tag
      fields.
      
      vlan_id and flow_label fields were under the same struct
      (flow_dissector_key_tags). It was a convenient setting since struct
      flow_dissector_key_tags is used by struct flow_keys and by setting
      vlan_id and flow_label under the same struct, we get precisely 24 or 48
      bytes in flow_keys from flow_dissector_key_basic.
      
      Now, when adding vlan priority support, the code will be cleaner if
      flow_label and vlan tag won't be under the same struct anymore.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6a66927
    • H
      flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci · d5709f7a
      Hadar Hen Zion 提交于
      Early in the datapath skb_vlan_untag function is called, stripped
      the vlan from the skb and set skb->vlan_tci and skb->vlan_proto fields.
      
      The current dissection doesn't handle stripped vlan packets correctly.
      In some flows, vlan doesn't exist in skb->data anymore when applying
      flow dissection on the skb, fix that.
      
      In case vlan info wasn't stripped before applying flow_dissector (RPS
      flow for example), or in case of skb with multiple vlans (e.g. 802.1ad),
      get the vlan info from skb->data. The flow_dissector correctly skips
      any number of vlans and stores only the first level vlan.
      
      Fixes: 0744dd00 ('net: introduce skb_flow_dissect()')
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5709f7a
    • J
      net: sched: avoid duplicates in qdisc dump · ea327469
      Jiri Kosina 提交于
      tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases;
      first, the "standard" dev->qdisc is being dumped. Second, if there is/are
      ingress queue(s), they are being dumped as well.
      
      After conversion of netdevice's qdisc linked-list into hashtable, these
      two sets are not in two disjunctive sets/lists any more, but are both
      "reachable" directly from netdevice's hashtable. As a consequence, the
      "full-depth" dump of the ingress qdiscs results in immediately hitting the
      netdevice hashtable again, and duplicating the dump that has already been
      performed for dev->qdisc.
      What in fact needs to be dumped in case of ingress queue is "just" the
      top-level ingress qdisc, as everything else has been dumped already.
      
      Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed
      whether it should (while performing the "full" per-netdev qdisc dump) perform
      the whole recursion, or just dump "additional" top-level (ingress) qdiscs
      without performing any kind of recursion.
      
      This fixes duplicate dumps such as
      
      	qdisc mq 0: root
      	qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc clsact ffff: parent ffff:fff1
      	qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      	qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
      
      Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea327469
    • J
      net: sched: fix handling of singleton qdiscs with qdisc_hash · 69012ae4
      Jiri Kosina 提交于
      qdisc_match_from_root() is now iterating over per-netdevice qdisc
      hashtable instead of going through a linked-list of qdiscs (independently
      on the actual underlying netdev), which was the case before the switch to
      hashtable for qdiscs.
      
      For singleton qdiscs, there is no underlying netdev associated though, and
      therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will
      always be NULL.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000410
       IP: [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70
       PGD 1aceba067 PUD 1aceb7067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
      [ ... ]
       task: ffff8801ec996e00 task.stack: ffff8801ec934000
       RIP: 0010:[<ffffffff8167efac>]  [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70
       RSP: 0018:ffff8801ec937ab0  EFLAGS: 00010203
       RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8
       RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100
       RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0
       R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000
       R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000
       FS:  00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0
       Stack:
        ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1
        ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500
        ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38
       Call Trace:
        [<ffffffff81681210>] qdisc_lookup+0x40/0x50
        [<ffffffff81681e4e>] tc_modify_qdisc+0x21e/0x550
        [<ffffffff8166ae25>] rtnetlink_rcv_msg+0x95/0x220
        [<ffffffff81209602>] ? __kmalloc_track_caller+0x172/0x230
        [<ffffffff8166ad90>] ? rtnl_newlink+0x870/0x870
        [<ffffffff816897b7>] netlink_rcv_skb+0xa7/0xc0
        [<ffffffff816657c8>] rtnetlink_rcv+0x28/0x30
        [<ffffffff8168919b>] netlink_unicast+0x15b/0x210
        [<ffffffff81689569>] netlink_sendmsg+0x319/0x390
        [<ffffffff816379f8>] sock_sendmsg+0x38/0x50
        [<ffffffff81638296>] ___sys_sendmsg+0x256/0x260
        [<ffffffff811b1275>] ? __pagevec_lru_add_fn+0x135/0x280
        [<ffffffff811b1a90>] ? pagevec_lru_move_fn+0xd0/0xf0
        [<ffffffff811b1140>] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180
        [<ffffffff811b1b85>] ? __lru_cache_add+0x75/0xb0
        [<ffffffff817708a6>] ? _raw_spin_unlock+0x16/0x40
        [<ffffffff811d8dff>] ? handle_mm_fault+0x39f/0x1160
        [<ffffffff81638b15>] __sys_sendmsg+0x45/0x80
        [<ffffffff81638b62>] SyS_sendmsg+0x12/0x20
        [<ffffffff810038e7>] do_syscall_64+0x57/0xb0
      
      Fix this by special-casing singleton qdiscs (those that don't have
      underlying netdevice) and introduce immediate handling of those rather
      than trying to go over an underlying netdevice. We're in the same
      situation in tc_dump_qdisc_root() and tc_dump_tclass_root().
      
      Ultimately, this will have to be slightly reworked so that we are actually
      able to show singleton qdiscs (noop) in the dump properly; but we're not
      currently doing that anyway, so no regression there, and better do this in
      a gradual manner.
      
      Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69012ae4
    • J
      tipc: ensure that link congestion and wakeup use same criteria · 5a0950c2
      Jon Paul Maloy 提交于
      When a link is attempted woken up after congestion, it uses a different,
      more generous criteria than when it was originally declared congested.
      This has the effect that the link, and the sending process, sometimes
      will be woken up unnecessarily, just to immediately return to congestion
      when it turns out there is not not enough space in its send queue to
      host the pending message. This is a waste of CPU cycles.
      
      We now change the function link_prepare_wakeup() to use exactly the same
      criteria as tipc_link_xmit(). However, since we are now excluding the
      window limit from the wakeup calculation, and the current backlog limit
      for the lowest level is too small to house even a single maximum-size
      message, we have to expand this limit. We do this by evaluating an
      alternative, minimum value during the setting of the importance limits.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0950c2
    • J
      tipc: make bearer packet filtering generic · 0d051bf9
      Jon Paul Maloy 提交于
      In commit 5b7066c3 ("tipc: stricter filtering of packets in bearer
      layer") we introduced a method of filtering out messages while a bearer
      is being reset, to avoid that links may be re-created and come back in
      working state while we are still in the process of shutting them down.
      
      This solution works well, but is limited to only work with L2 media, which
      is insufficient with the increasing use of UDP as carrier media.
      
      We now replace this solution with a more generic one, by introducing a
      new flag "up" in the generic struct tipc_bearer. This field will be set
      and reset at the same locations as with the previous solution, while
      the packet filtering is moved to the generic code for the sending side.
      On the receiving side, the filtering is still done in media specific
      code, but now including the UDP bearer.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d051bf9
    • C
      net: atm: remove redundant null pointer check on dev->name · 0d135e4f
      Colin Ian King 提交于
      dev->name is a char array of IFNAMSIZ elements, hence can never be
      null, so the null pointer check is redundant. Remove it.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d135e4f
  2. 18 8月, 2016 8 次提交
    • T
      kcm: Use stream parser · 9b73896a
      Tom Herbert 提交于
      Adapt KCM to use the stream parser. This mostly involves removing
      the RX handling and setting up the strparser using the interface.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b73896a
    • T
      strparser: Stream parser for messages · 43a0c675
      Tom Herbert 提交于
      This patch introduces a utility for parsing application layer protocol
      messages in a TCP stream. This is a generalization of the mechanism
      implemented of Kernel Connection Multiplexor.
      
      The API includes a context structure, a set of callbacks, utility
      functions, and a data ready function.
      
      A stream parser instance is defined by a strparse structure that
      is bound to a TCP socket. The function to initialize the structure
      is:
      
      int strp_init(struct strparser *strp, struct sock *csk,
                    struct strp_callbacks *cb);
      
      csk is the TCP socket being bound to and cb are the parser callbacks.
      
      The upper layer calls strp_tcp_data_ready when data is ready on the lower
      socket for strparser to process. This should be called from a data_ready
      callback that is set on the socket:
      
      void strp_tcp_data_ready(struct strparser *strp);
      
      A parser is bound to a TCP socket by setting data_ready function to
      strp_tcp_data_ready so that all receive indications on the socket
      go through the parser. This is assumes that sk_user_data is set to
      the strparser structure.
      
      There are four callbacks.
       - parse_msg is called to parse the message (returns length or error).
       - rcv_msg is called when a complete message has been received
       - read_sock_done is called when data_ready function exits
       - abort_parser is called to abort the parser
      
      The input to parse_msg is an skbuff which contains next message under
      construction. The backend processing of parse_msg will parse the
      application layer protocol headers to determine the length of
      the message in the stream. The possible return values are:
      
         >0 : indicates length of successfully parsed message
         0  : indicates more data must be received to parse the message
         -ESTRPIPE : current message should not be processed by the
            kernel, return control of the socket to userspace which
            can proceed to read the messages itself
         other < 0 : Error is parsing, give control back to userspace
            assuming that synchronzation is lost and the stream
            is unrecoverable (application expected to close TCP socket)
      
      In the case of error return (< 0) strparse will stop the parser
      and report and error to userspace. The application must deal
      with the error. To handle the error the strparser is unbound
      from the TCP socket. If the error indicates that the stream
      TCP socket is at recoverable point (ESTRPIPE) then the application
      can read the TCP socket to process the stream. Once the application
      has dealt with the exceptions in the stream, it may again bind the
      socket to a strparser to continue data operations.
      
      Note that ENODATA may be returned to the application. In this case
      parse_msg returned -ESTRPIPE, however strparser was unable to maintain
      synchronization of the stream (i.e. some of the message in question
      was already read by the parser).
      
      strp_pause and strp_unpause are used to provide flow control. For
      instance, if rcv_msg is called but the upper layer can't immediately
      consume the message it can hold the message and pause strparser.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a0c675
    • T
      net: ipconfig: Fix more use after free · d2d371ae
      Thierry Reding 提交于
      While commit 9c706a49 ("net: ipconfig: fix use after free") avoids
      the use after free, the resulting code still ends up calling both the
      ic_setup_if() and ic_setup_routes() after calling ic_close_devs(), and
      access to the device is still required.
      
      Move the call to ic_close_devs() to the very end of the function.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2d371ae
    • R
      net_sched: allow flushing tc police actions · b5ac8518
      Roman Mashak 提交于
      The act_police uses its own code to walk the
      action hashtable, which leads to that we could
      not flush standalone tc police actions, so just
      switch to tcf_generic_walker() like other actions.
      
      (Joint work from Roman and Cong.)
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5ac8518
    • W
      net_sched: unify the init logic for act_police · 0852e455
      WANG Cong 提交于
      Jamal reported a crash when we create a police action
      with a specific index, this is because the init logic
      is not correct, we should always create one for this
      case. Just unify the logic with other tc actions.
      
      Fixes: a03e6fe5 ("act_police: fix a crash during removal")
      Reported-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0852e455
    • W
      net_sched: convert tcf_exts from list to pointer array · 22dc13c8
      WANG Cong 提交于
      As pointed out by Jamal, an action could be shared by
      multiple filters, so we can't use list to chain them
      any more after we get rid of the original tc_action.
      Instead, we could just save pointers to these actions
      in tcf_exts, since they are refcount'ed, so convert
      the list to an array of pointers.
      
      The "ugly" part is the action API still accepts list
      as a parameter, I just introduce a helper function to
      convert the array of pointers to a list, instead of
      relying on the C99 feature to iterate the array.
      
      Fixes: a85a970a ("net_sched: move tc_action into tcf_common")
      Reported-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22dc13c8
    • W
      net_sched: remove an unnecessary list_del() · 824a7e88
      WANG Cong 提交于
      This list_del() for tc action is not needed actually,
      because we only use this list to chain bulk operations,
      therefore should not be carried for latter operations.
      
      Fixes: ec0595cc ("net_sched: get rid of struct tcf_common")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      824a7e88
    • W
      net_sched: remove the leftover cleanup_a() · f07fed82
      WANG Cong 提交于
      After refactoring tc_action into tcf_common, we no
      longer need to cleanup temporary "actions" in list,
      they are permanently stored in the hashtable.
      
      Fixes: a85a970a ("net_sched: move tc_action into tcf_common")
      Reported-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f07fed82
  3. 16 8月, 2016 4 次提交
    • V
      tipc: fix NULL pointer dereference in shutdown() · d2fbdf76
      Vegard Nossum 提交于
      tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
      call tipc_node_xmit_skb() on it.
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
          RIP: 0010:[<ffffffff830bb46b>]  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
          RSP: 0018:ffff8800595bfce8  EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
          RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
          RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
          R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
          R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
          FS:  00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
          DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
          Stack:
           0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
           ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
           0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
          Call Trace:
           [<ffffffff830c84a3>] tipc_shutdown+0x553/0x880
           [<ffffffff825b4a3b>] SyS_shutdown+0x14b/0x170
           [<ffffffff8100334c>] do_syscall_64+0x19c/0x410
           [<ffffffff83295ca5>] entry_SYSCALL64_slow_path+0x25/0x25
          Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 <80> 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
          RIP  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
           RSP <ffff8800595bfce8>
          ---[ end trace 57b0484e351e71f1 ]---
      
      I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
      userspace is equipped to handle that. Anyway, this is better than a GPF
      and looks somewhat consistent with other tipc_msg_create() callers.
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2fbdf76
    • O
      switchdev: Put export declaration in the right place · 2eb03e6c
      Or Gerlitz 提交于
      Move exporting of switchdev_port_same_parent_id to be right
      below it and not elsewhere.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reported-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2eb03e6c
    • S
      gre: set inner_protocol on xmit · 3d7b3320
      Simon Horman 提交于
      Ensure that the inner_protocol is set on transmit so that GSO segmentation,
      which relies on that field, works correctly.
      
      This is achieved by setting the inner_protocol in gre_build_header rather
      than each caller of that function. It ensures that the inner_protocol is
      set when gre_fb_xmit() is used to transmit GRE which was not previously the
      case.
      
      I have observed this is not the case when OvS transmits GRE using
      lwtunnel metadata (which it always does).
      
      Fixes: 38720352 ("gre: Use inner_proto to obtain inner header protocol")
      Cc: Pravin Shelar <pshelar@ovn.org>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d7b3320
    • L
      net: ipv6: Fix ping to link-local addresses. · 5e457896
      Lorenzo Colitti 提交于
      ping_v6_sendmsg does not set flowi6_oif in response to
      sin6_scope_id or sk_bound_dev_if, so it is not possible to use
      these APIs to ping an IPv6 address on a different interface.
      Instead, it sets flowi6_iif, which is incorrect but harmless.
      
      Stop setting flowi6_iif, and support various ways of setting oif
      in the same priority order used by udpv6_sendmsg.
      
      Tested: https://android-review.googlesource.com/#/c/254470/Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e457896
  4. 15 8月, 2016 3 次提交
  5. 14 8月, 2016 6 次提交
    • S
      net: remove type_check from dev_get_nest_level() · 952fcfd0
      Sabrina Dubroca 提交于
      The idea for type_check in dev_get_nest_level() was to count the number
      of nested devices of the same type (currently, only macvlan or vlan
      devices).
      This prevented the false positive lockdep warning on configurations such
      as:
      
      eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
      
      However, this doesn't prevent a warning on a configuration such as:
      
      eth0 <--- macvlan0 <--- vlan0
      eth1 <--- vlan1 <--- macvlan1
      
      In this case, all the locks end up with a nesting subclass of 1, so
      lockdep thinks that there is still a deadlock:
      
      - in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
        take (vlan_netdev_xmit_lock_key, 1)
      - in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
        take (macvlan_netdev_addr_lock_key, 1)
      
      By removing the linktype check in dev_get_nest_level() and always
      incrementing the nesting depth, lockdep considers this configuration
      valid.
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      952fcfd0
    • M
      net: ipv6: Do not keep IPv6 addresses when IPv6 is disabled · bc561632
      Mike Manning 提交于
      If IPv6 is disabled when the option is set to keep IPv6
      addresses on link down, userspace is unaware of this as
      there is no such indication via netlink. The solution is to
      remove the IPv6 addresses in this case, which results in
      netlink messages indicating removal of addresses in the
      usual manner. This fix also makes the behavior consistent
      with the case of having IPv6 disabled first, which stops
      IPv6 addresses from being added.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NMike Manning <mmanning@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc561632
    • V
      net/sctp: always initialise sctp_ht_iter::start_fail · 54236ab0
      Vegard Nossum 提交于
      sctp_transport_seq_start() does not currently clear iter->start_fail on
      success, but relies on it being zero when it is allocated (by
      seq_open_net()).
      
      This can be a problem in the following sequence:
      
          open() // allocates iter (and implicitly sets iter->start_fail = 0)
          read()
           - iter->start() // fails and sets iter->start_fail = 1
           - iter->stop() // doesn't call sctp_transport_walk_stop() (correct)
          read() again
           - iter->start() // succeeds, but doesn't change iter->start_fail
           - iter->stop() // doesn't call sctp_transport_walk_stop() (wrong)
      
      We should initialize sctp_ht_iter::start_fail to zero if ->start()
      succeeds, otherwise it's possible that we leave an old value of 1 there,
      which will cause ->stop() to not call sctp_transport_walk_stop(), which
      causes all sorts of problems like not calling rcu_read_unlock() (and
      preempt_enable()), eventually leading to more warnings like this:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 16551, name: trinity-c2
          Preemption disabled at:[<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
      
           [<ffffffff81149abb>] preempt_count_add+0x1fb/0x280
           [<ffffffff83295892>] _raw_spin_lock+0x12/0x40
           [<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
           [<ffffffff82ec665f>] sctp_transport_walk_start+0x2f/0x60
           [<ffffffff82edda1d>] sctp_transport_seq_start+0x4d/0x150
           [<ffffffff81439e50>] traverse+0x170/0x850
           [<ffffffff8143aeec>] seq_read+0x7cc/0x1180
           [<ffffffff814f996c>] proc_reg_read+0xbc/0x180
           [<ffffffff813d0384>] do_loop_readv_writev+0x134/0x210
           [<ffffffff813d2a95>] do_readv_writev+0x565/0x660
           [<ffffffff813d6857>] vfs_readv+0x67/0xa0
           [<ffffffff813d6c16>] do_preadv+0x126/0x170
           [<ffffffff813d710c>] SyS_preadv+0xc/0x10
           [<ffffffff8100334c>] do_syscall_64+0x19c/0x410
           [<ffffffff83296225>] return_from_SYSCALL_64+0x0/0x6a
           [<ffffffffffffffff>] 0xffffffffffffffff
      
      Notice that this is a subtly different stacktrace from the one in commit
      5fc382d8 ("net/sctp: terminate rhashtable walk correctly").
      
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-By: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54236ab0
    • V
      net/irda: handle iriap_register_lsap() allocation failure · 5ba092ef
      Vegard Nossum 提交于
      If iriap_register_lsap() fails to allocate memory, self->lsap is
      set to NULL. However, none of the callers handle the failure and
      irlmp_connect_request() will happily dereference it:
      
          iriap_register_lsap: Unable to allocated LSAP!
          ================================================================================
          UBSAN: Undefined behaviour in net/irda/irlmp.c:378:2
          member access within null pointer of type 'struct lsap_cb'
          CPU: 1 PID: 15403 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #81
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
          04/01/2014
           0000000000000000 ffff88010c7e78a8 ffffffff82344f40 0000000041b58ab3
           ffffffff84f98000 ffffffff82344e94 ffff88010c7e78d0 ffff88010c7e7880
           ffff88010630ad00 ffffffff84a5fae0 ffffffff84d3f5c0 000000000000017a
          Call Trace:
           [<ffffffff82344f40>] dump_stack+0xac/0xfc
           [<ffffffff8242f5a8>] ubsan_epilogue+0xd/0x8a
           [<ffffffff824302bf>] __ubsan_handle_type_mismatch+0x157/0x411
           [<ffffffff83b7bdbc>] irlmp_connect_request+0x7ac/0x970
           [<ffffffff83b77cc0>] iriap_connect_request+0xa0/0x160
           [<ffffffff83b77f48>] state_s_disconnect+0x88/0xd0
           [<ffffffff83b78904>] iriap_do_client_event+0x94/0x120
           [<ffffffff83b77710>] iriap_getvaluebyclass_request+0x3e0/0x6d0
           [<ffffffff83ba6ebb>] irda_find_lsap_sel+0x1eb/0x630
           [<ffffffff83ba90c8>] irda_connect+0x828/0x12d0
           [<ffffffff833c0dfb>] SYSC_connect+0x22b/0x340
           [<ffffffff833c7e09>] SyS_connect+0x9/0x10
           [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0
           [<ffffffff845f946a>] entry_SYSCALL64_slow_path+0x25/0x25
          ================================================================================
      
      The bug seems to have been around since forever.
      
      There's more problems with missing error checks in iriap_init() (and
      indeed all of irda_init()), but that's a bigger problem that needs
      very careful review and testing. This patch will fix the most serious
      bug (as it's easily reached from unprivileged userspace).
      
      I have tested my patch with a reproducer.
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba092ef
    • D
      bpf: fix write helpers with regards to non-linear parts · 0ed661d5
      Daniel Borkmann 提交于
      Fix the bpf_try_make_writable() helper and all call sites we have in BPF,
      it's currently defect with regards to skbs when the write_len spans into
      non-linear parts, no matter if cloned or not.
      
      There are multiple issues at once. First, using skb_store_bits() is not
      correct since even if we have a cloned skb, page frags can still be shared.
      To really make them private, we need to pull them in via __pskb_pull_tail()
      first, which also gets us a private head via pskb_expand_head() implicitly.
      
      This is for helpers like bpf_skb_store_bytes(), bpf_l3_csum_replace(),
      bpf_l4_csum_replace(). Really, the only thing reasonable and working here
      is to call skb_ensure_writable() before any write operation. Meaning, via
      pskb_may_pull() it makes sure that parts we want to access are pulled in and
      if not does so plus unclones the skb implicitly. If our write_len still fits
      the headlen and we're cloned and our header of the clone is not writable,
      then we need to make a private copy via pskb_expand_head(). skb_store_bits()
      is a bit misleading and only safe to store into non-linear data in different
      contexts such as 357b40a1 ("[IPV6]: IPV6_CHECKSUM socket option can
      corrupt kernel memory").
      
      For above BPF helper functions, it means after fixed bpf_try_make_writable(),
      we've pulled in enough, so that we operate always based on skb->data. Thus,
      the call to skb_header_pointer() and skb_store_bits() becomes superfluous.
      In bpf_skb_store_bytes(), the len check is unnecessary too since it can
      only pass in maximum of BPF stack size, so adding offset is guaranteed to
      never overflow. Also bpf_l3/4_csum_replace() helpers must test for proper
      offset alignment since they use __sum16 pointer for writing resulting csum.
      
      The remaining helpers that change skb data not discussed here yet are
      bpf_skb_vlan_push(), bpf_skb_vlan_pop() and bpf_skb_change_proto(). The
      vlan helpers internally call either skb_ensure_writable() (pop case) and
      skb_cow_head() (push case, for head expansion), respectively. Similarly,
      bpf_skb_proto_xlat() takes care to not mangle page frags.
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Fixes: 3697649f ("bpf: try harder on clones when writing into skb")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed661d5
    • C
      calipso: fix resource leak on calipso_genopt failure · b4c0e0c6
      Colin Ian King 提交于
      Currently, if calipso_genopt fails then the error exit path
      does not free the ipv6_opt_hdr new causing a memory leak. Fix
      this by kfree'ing new on the error exit path.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4c0e0c6
  6. 13 8月, 2016 2 次提交
  7. 12 8月, 2016 3 次提交
  8. 11 8月, 2016 1 次提交
    • D
      cfg80211: always notify userspace when wireless netdev is removed · 7f8ed01e
      Denis Kenzior 提交于
      This change alters the semantics of NL80211_CMD_DEL_INTERFACE events
      by always sending this event whenever a net_device object associated
      with a wdev is destroyed.  Prior to this change, this event was only
      emitted as a result of NL80211_CMD_DEL_INTERFACE command sent from
      userspace.  This allows userspace to reliably detect when wireless
      interfaces have been removed, e.g. due to USB removal events, etc.
      
      For wireless device objects without an associated net_device (e.g.
      NL80211_IFTYPE_P2P_DEVICE), the NL80211_CMD_DEL_INTERFACE event is
      now generated inside cfg80211_unregister_wdev.
      Signed-off-by: NDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      7f8ed01e