1. 29 9月, 2017 2 次提交
    • M
      bpf: Add name, load_time, uid and map_ids to bpf_prog_info · cb4d2b3f
      Martin KaFai Lau 提交于
      The patch adds name and load_time to struct bpf_prog_aux.  They
      are also exported to bpf_prog_info.
      
      The bpf_prog's name is passed by userspace during BPF_PROG_LOAD.
      The kernel only stores the first (BPF_PROG_NAME_LEN - 1) bytes
      and the name stored in the kernel is always \0 terminated.
      
      The kernel will reject name that contains characters other than
      isalnum() and '_'.  It will also reject name that is not null
      terminated.
      
      The existing 'user->uid' of the bpf_prog_aux is also exported to
      the bpf_prog_info as created_by_uid.
      
      The existing 'used_maps' of the bpf_prog_aux is exported to
      the newly added members 'nr_map_ids' and 'map_ids' of
      the bpf_prog_info.  On the input, nr_map_ids tells how
      big the userspace's map_ids buffer is.  On the output,
      nr_map_ids tells the exact user_map_cnt and it will only
      copy up to the userspace's map_ids buffer is allowed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb4d2b3f
    • N
      net: bridge: add per-port group_fwd_mask with less restrictions · 5af48b59
      Nikolay Aleksandrov 提交于
      We need to be able to transparently forward most link-local frames via
      tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
      mask which restricts the forwarding of STP and LACP, but we need to be able
      to forward these over tunnels and control that forwarding on a per-port
      basis thus add a new per-port group_fwd_mask option which only disallows
      mac pause frames to be forwarded (they're always dropped anyway).
      The patch does not change the current default situation - all of the others
      are still restricted unless configured for forwarding.
      We have successfully tested this patch with LACP and STP forwarding over
      VxLAN and qinq tunnels.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5af48b59
  2. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  3. 26 9月, 2017 2 次提交
    • P
      tun: enable napi_gro_frags() for TUN/TAP driver · 90e33d45
      Petar Penkov 提交于
      Add a TUN/TAP receive mode that exercises the napi_gro_frags()
      interface. This mode is available only in TAP mode, as the interface
      expects packets with Ethernet headers.
      
      Furthermore, packets follow the layout of the iovec_iter that was
      received. The first iovec is the linear data, and every one after the
      first is a fragment. If there are more fragments than the max number,
      drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
      dissector code and to verify that the header resides in the linear data.
      
      The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
      This is imposed because this mode is intended for testing via tools like
      syzkaller and packetdrill, and the increased flexibility it provides can
      introduce security vulnerabilities. This flag is accepted only if the
      device is in TAP mode and has the IFF_NAPI flag set as well. This is
      done because both of these are explicit requirements for correct
      operation in this mode.
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google,com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e33d45
    • P
      tun: enable NAPI for TUN/TAP driver · 94317099
      Petar Penkov 提交于
      Changes TUN driver to use napi_gro_receive() upon receiving packets
      rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
      changes and operation is not affected if the flag is disabled.  SKBs
      are constructed upon packet arrival and are queued to be processed
      later.
      
      The new path was evaluated with a benchmark with the following setup:
      Open two tap devices and a receiver thread that reads in a loop for
      each device. Start one sender thread and pin all threads to different
      CPUs. Send 1M minimum UDP packets to each device and measure sending
      time for each of the sending methods:
      	napi_gro_receive():	4.90s
      	netif_rx_ni():		4.90s
      	netif_receive_skb():	7.20s
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94317099
  4. 22 9月, 2017 1 次提交
    • F
      net: ethtool: Add back transceiver type · 19cab887
      Florian Fainelli 提交于
      Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      deprecated the ethtool_cmd::transceiver field, which was fine in
      premise, except that the PHY library was actually using it to report the
      type of transceiver: internal or external.
      
      Use the first word of the reserved field to put this __u8 transceiver
      field back in. It is made read-only, and we don't expect the
      ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
      is mostly for the legacy path where we do:
      
      ethtool_get_settings()
      -> dev->ethtool_ops->get_link_ksettings()
         -> convert_link_ksettings_to_legacy_settings()
      
      to have no information loss compared to the legacy get_settings API.
      
      Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19cab887
  5. 09 9月, 2017 4 次提交
  6. 07 9月, 2017 6 次提交
  7. 05 9月, 2017 16 次提交
  8. 04 9月, 2017 3 次提交
    • P
      netlink: add NLM_F_NONREC flag for deletion requests · 2335ba70
      Pablo Neira Ayuso 提交于
      In the last NFWS in Faro, Portugal, we discussed that netlink is lacking
      the semantics to request non recursive deletions, ie. do not delete an
      object iff it has child objects that hang from this parent object that
      the user requests to be deleted.
      
      We need this new flag to solve a problem for the iptables-compat
      backward compatibility utility, that runs iptables commands using the
      existing nf_tables netlink interface. Specifically, custom chains in
      iptables cannot be deleted if there are rules in it, however, nf_tables
      allows to remove any chain that is populated with content. To sort out
      this asymmetry, iptables-compat userspace sets this new NLM_F_NONREC
      flag to obtain the same semantics that iptables provides.
      
      This new flag should only be used for deletion requests. Note this new
      flag value overlaps with the existing:
      
      * NLM_F_ROOT for get requests.
      * NLM_F_REPLACE for new requests.
      
      However, those flags should not ever be used in deletion requests.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2335ba70
    • P
      netfilter: nft_limit: add stateful object type · a6912055
      Pablo M. Bermudo Garay 提交于
      Register a new limit stateful object type into the stateful object
      infrastructure.
      Signed-off-by: NPablo M. Bermudo Garay <pablombg@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a6912055
    • V
      netfilter: xt_hashlimit: add rate match mode · bea74641
      Vishwanath Pai 提交于
      This patch adds a new feature to hashlimit that allows matching on the
      current packet/byte rate without rate limiting. This can be enabled
      with a new flag --hashlimit-rate-match. The match returns true if the
      current rate of packets is above/below the user specified value.
      
      The main difference between the existing algorithm and the new one is
      that the existing algorithm rate-limits the flow whereas the new
      algorithm does not. Instead it *classifies* the flow based on whether
      it is above or below a certain rate. I will demonstrate this with an
      example below. Let us assume this rule:
      
      iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain
      
      If the packet rate is 15/s, the existing algorithm would ACCEPT 10
      packets every second and send 5 packets to "new_chain".
      
      But with the new algorithm, as long as the rate of 15/s is sustained,
      all packets will continue to match and every packet is sent to new_chain.
      
      This new functionality will let us classify different flows based on
      their current rate, so that further decisions can be made on them based on
      what the current rate is.
      
      This is how the new algorithm works:
      We divide time into intervals of 1 (sec/min/hour) as specified by
      the user. We keep track of the number of packets/bytes processed in the
      current interval. After each interval we reset the counter to 0.
      
      When we receive a packet for match, we look at the packet rate
      during the current interval and the previous interval to make a
      decision:
      
      if [ prev_rate < user and cur_rate < user ]
              return Below
      else
              return Above
      
      Where cur_rate is the number of packets/bytes seen in the current
      interval, prev is the number of packets/bytes seen in the previous
      interval and 'user' is the rate specified by the user.
      
      We also provide flexibility to the user for choosing the time
      interval using the option --hashilmit-interval. For example the user can
      keep a low rate like x/hour but still keep the interval as small as 1
      second.
      
      To preserve backwards compatibility we have to add this feature in a new
      revision, so I've created revision 3 for hashlimit. The two new options
      we add are:
      
      --hashlimit-rate-match
      --hashlimit-rate-interval
      
      I have updated the help text to add these new options. Also added a few
      tests for the new options.
      Suggested-by: NIgor Lubashev <ilubashe@akamai.com>
      Reviewed-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      bea74641
  9. 02 9月, 2017 3 次提交
    • I
      tcp_diag: report TCP MD5 signing keys and addresses · c03fa9bc
      Ivan Delalande 提交于
      Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
      processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
      not possible to retrieve these from the kernel once they have been
      configured on sockets.
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c03fa9bc
    • D
      fsmap: fix documentation of FMR_OF_LAST · d897246d
      Darrick J. Wong 提交于
      The FMR_OF_LAST flag is set on the last fsmap record being returned for
      the dataset requested, contrary to what the header file says.  Fix the
      docs to reflect the behavior of all fsmap implementations.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      d897246d
    • S
      Introduce v3 namespaced file capabilities · 8db6c34f
      Serge E. Hallyn 提交于
      Root in a non-initial user ns cannot be trusted to write a traditional
      security.capability xattr.  If it were allowed to do so, then any
      unprivileged user on the host could map his own uid to root in a private
      namespace, write the xattr, and execute the file with privilege on the
      host.
      
      However supporting file capabilities in a user namespace is very
      desirable.  Not doing so means that any programs designed to run with
      limited privilege must continue to support other methods of gaining and
      dropping privilege.  For instance a program installer must detect
      whether file capabilities can be assigned, and assign them if so but set
      setuid-root otherwise.  The program in turn must know how to drop
      partial capabilities, and do so only if setuid-root.
      
      This patch introduces v3 of the security.capability xattr.  It builds a
      vfs_ns_cap_data struct by appending a uid_t rootid to struct
      vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
      namespace which mounted the filesystem, usually init_user_ns) of the
      root id in whose namespaces the file capabilities may take effect.
      
      When a task asks to write a v2 security.capability xattr, if it is
      privileged with respect to the userns which mounted the filesystem, then
      nothing should change.  Otherwise, the kernel will transparently rewrite
      the xattr as a v3 with the appropriate rootid.  This is done during the
      execution of setxattr() to catch user-space-initiated capability writes.
      Subsequently, any task executing the file which has the noted kuid as
      its root uid, or which is in a descendent user_ns of such a user_ns,
      will run the file with capabilities.
      
      Similarly when asking to read file capabilities, a v3 capability will
      be presented as v2 if it applies to the caller's namespace.
      
      If a task writes a v3 security.capability, then it can provide a uid for
      the xattr so long as the uid is valid in its own user namespace, and it
      is privileged with CAP_SETFCAP over its namespace.  The kernel will
      translate that rootid to an absolute uid, and write that to disk.  After
      this, a task in the writer's namespace will not be able to use those
      capabilities (unless rootid was 0), but a task in a namespace where the
      given uid is root will.
      
      Only a single security.capability xattr may exist at a time for a given
      file.  A task may overwrite an existing xattr so long as it is
      privileged over the inode.  Note this is a departure from previous
      semantics, which required privilege to remove a security.capability
      xattr.  This check can be re-added if deemed useful.
      
      This allows a simple setxattr to work, allows tar/untar to work, and
      allows us to tar in one namespace and untar in another while preserving
      the capability, without risking leaking privilege into a parent
      namespace.
      
      Example using tar:
      
       $ cp /bin/sleep sleepx
       $ mkdir b1 b2
       $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
       $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
       $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
       $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
       $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
         b2/sleepx = cap_sys_admin+ep
       # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
         v3 xattr, rootid is 100001
      
      A patch to linux-test-project adding a new set of tests for this
      functionality is in the nsfscaps branch at github.com/hallyn/ltp
      
      Changelog:
         Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
         Nov 07 2016: convert rootid from and to fs user_ns
         (From ebiederm: mar 28 2017)
           commoncap.c: fix typos - s/v4/v3
           get_vfs_caps_from_disk: clarify the fs_ns root access check
           nsfscaps: change the code split for cap_inode_setxattr()
         Apr 09 2017:
             don't return v3 cap for caps owned by current root.
            return a v2 cap for a true v2 cap in non-init ns
         Apr 18 2017:
            . Change the flow of fscap writing to support s_user_ns writing.
            . Remove refuse_fcap_overwrite().  The value of the previous
              xattr doesn't matter.
         Apr 24 2017:
            . incorporate Eric's incremental diff
            . move cap_convert_nscap to setxattr and simplify its usage
         May 8, 2017:
            . fix leaking dentry refcount in cap_inode_getsecurity
      Signed-off-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      8db6c34f
  10. 01 9月, 2017 2 次提交