1. 07 8月, 2015 1 次提交
    • P
      netfilter: nf_tables: add nft_dup expression · d877f071
      Pablo Neira Ayuso 提交于
      This new expression uses the nf_dup engine to clone packets to a given gateway.
      Unlike xt_TEE, we use an index to indicate output interface which should be
      fine at this stage.
      
      Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
      nf_dup_ipv{4,6} to silence a lockdep splat.
      
      Based on the original tee expression from Arturo Borrero Gonzalez, although
      this patch has diverted quite a bit from this initial effort due to the
      change to support maps.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d877f071
  2. 04 8月, 2015 1 次提交
  3. 03 8月, 2015 1 次提交
    • D
      ebpf: add skb->hash to offset map for usage in {cls, act}_bpf or filters · ba7591d8
      Daniel Borkmann 提交于
      Add skb->hash to the __sk_buff offset map, so it can be accessed from
      an eBPF program. We currently already do this for classic BPF filters,
      but not yet on eBPF, it might be useful as a demuxer in combination with
      helpers like bpf_clone_redirect(), toy example:
      
        __section("cls-lb") int ingress_main(struct __sk_buff *skb)
        {
          unsigned int which = 3 + (skb->hash & 7);
          /* bpf_skb_store_bytes(skb, ...); */
          /* bpf_l{3,4}_csum_replace(skb, ...); */
          bpf_clone_redirect(skb, which, 0);
          return -1;
        }
      
      I was thinking whether to add skb_get_hash(), but then concluded the
      raw skb->hash seems fine in this case: we can directly access the hash
      w/o extra eBPF helper function call, it's filled out by many NICs on
      ingress, and in case the entropy level would not be sufficient, people
      can still implement their own specific sw fallback hash mix anyway.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7591d8
  4. 01 8月, 2015 3 次提交
  5. 31 7月, 2015 1 次提交
    • H
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 8013d1d7
      Hangbin Liu 提交于
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
      
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      standards.
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8013d1d7
  6. 30 7月, 2015 1 次提交
    • M
      netfilter: nf_ct_sctp: minimal multihoming support · d7ee3519
      Michal Kubeček 提交于
      Currently nf_conntrack_proto_sctp module handles only packets between
      primary addresses used to establish the connection. Any packets between
      secondary addresses are classified as invalid so that usual firewall
      configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
      establish a new conntrack would allow traffic between secondary
      addresses to pass through. A more sophisticated solution based on the
      addresses advertised in the initial handshake (and possibly also later
      dynamic address addition and removal) would be much harder to implement.
      Moreover, in general we cannot assume to always see the initial
      handshake as it can be routed through a different path.
      
      The patch adds two new conntrack states:
      
        SCTP_CONNTRACK_HEARTBEAT_SENT  - a HEARTBEAT chunk seen but not acked
        SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK
      
      State transition rules:
      
      - HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that
        the behaviour changes as little as possible)
      - HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED
        does, except the resulting state is HEARTBEAT_ACKED rather than
        ESTABLISHED
      - previously existing states except NONE are preserved when HEARTBEAT or
        HEARTBEAT-ACK is seen
      - NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT
        and to CLOSED on HEARTBEAT-ACK
      - HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the
        reply direction
      - HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and
        HEARTBEAT-ACK otherwise
      
      Normally, vtag is set from the INIT chunk for the reply direction and
      from the INIT-ACK chunk for the originating direction (i.e. each of
      these defines vtag value for the opposite direction). For secondary
      conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
      seen them, we would need to connect two different conntracks. Therefore
      simplified logic is applied: vtag of first packet in each direction
      (HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
      saved and all following packets in that direction are compared with this
      saved value. While INIT and INIT-ACK define vtag for the opposite
      direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always
      for their direction.
      
      Default timeout values for new states are
      
        HEARTBEAT_SENT: 30 seconds (default hb_interval)
        HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)
      
      (We cannot expect to see the shutdown sequence so that, unlike
      ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d7ee3519
  7. 27 7月, 2015 1 次提交
  8. 23 7月, 2015 1 次提交
  9. 22 7月, 2015 8 次提交
  10. 21 7月, 2015 2 次提交
    • A
      bpf: introduce bpf_skb_vlan_push/pop() helpers · 4e10df9a
      Alexei Starovoitov 提交于
      Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
      helper functions. These functions may change skb->data/hlen which are
      cached by some JITs to improve performance of ld_abs/ld_ind instructions.
      Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
      re-compute header len and re-cache skb->data/hlen back into cpu registers.
      Note, skb->data/hlen are not directly accessible from the programs,
      so any changes to skb->data done either by these helpers or by other
      TC actions are safe.
      
      eBPF JIT supported by three architectures:
      - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
      - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
        (experiments showed that conditional re-caching is slower).
      - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
        in the program (re-caching is tbd).
      
      These helpers allow more scalable handling of vlan from the programs.
      Instead of creating thousands of vlan netdevs on top of eth0 and attaching
      TC+ingress+bpf to all of them, the program can be attached to eth0 directly
      and manipulate vlans as necessary.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e10df9a
    • D
      ebpf: add helper to retrieve net_cls's classid cookie · 8d20aabe
      Daniel Borkmann 提交于
      It would be very useful to retrieve the net_cls's classid from an eBPF
      program to allow for a more fine-grained classification, it could be
      directly used or in conjunction with additional policies. I.e. docker,
      but also tooling such as cgexec, can easily run applications via net_cls
      cgroups:
      
        cgcreate -g net_cls:/foo
        echo 42 > foo/net_cls.classid
        cgexec -g net_cls:foo <prog>
      
      Thus, their respecitve classid cookie of foo can then be looked up on
      the egress path to apply further policies. The helper is desigend such
      that a non-zero value returns the cgroup id.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d20aabe
  11. 18 7月, 2015 1 次提交
  12. 16 7月, 2015 1 次提交
  13. 14 7月, 2015 1 次提交
  14. 09 7月, 2015 1 次提交
  15. 07 7月, 2015 2 次提交
  16. 01 7月, 2015 2 次提交
  17. 30 6月, 2015 1 次提交
  18. 26 6月, 2015 1 次提交
  19. 25 6月, 2015 5 次提交
    • D
      libnvdimm: pmem label sets and namespace instantiation. · bf9bccc1
      Dan Williams 提交于
      A complete label set is a PMEM-label per-dimm per-interleave-set where
      all the UUIDs match and the interleave set cookie matches the hosting
      interleave set.
      
      Present sysfs attributes for manipulation of a PMEM-namespace's
      'alt_name', 'uuid', and 'size' attributes.  A later patch will make
      these settings persistent by writing back the label.
      
      Note that PMEM allocations grow forwards from the start of an interleave
      set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
      with a PMEM interleave set will grow allocations backward from the
      highest DPA.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      bf9bccc1
    • D
      libnvdimm: namespace indices: read and validate · 4a826c83
      Dan Williams 提交于
      This on media label format [1] consists of two index blocks followed by
      an array of labels.  None of these structures are ever updated in place.
      A sequence number tracks the current active index and the next one to
      write, while labels are written to free slots.
      
          +------------+
          |            |
          |  nsindex0  |
          |            |
          +------------+
          |            |
          |  nsindex1  |
          |            |
          +------------+
          |   label0   |
          +------------+
          |   label1   |
          +------------+
          |            |
           ....nslot...
          |            |
          +------------+
          |   labelN   |
          +------------+
      
      After reading valid labels, store the dpa ranges they claim into
      per-dimm resource trees.
      
      [1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
      
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4a826c83
    • D
      libnvdimm: support for legacy (non-aliasing) nvdimms · 3d88002e
      Dan Williams 提交于
      The libnvdimm region driver is an intermediary driver that translates
      non-volatile "region"s into "namespace" sub-devices that are surfaced by
      persistent memory block-device drivers (PMEM and BLK).
      
      ACPI 6 introduces the concept that a given nvdimm may simultaneously
      offer multiple access modes to its media through direct PMEM load/store
      access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
      interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
      If an nvdimm is single interfaced, then there is no need for dimm
      metadata labels.  For these devices we can take the region boundaries
      directly to create a child namespace device (nd_namespace_io).
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      3d88002e
    • D
      libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure · 4d88a97a
      Dan Williams 提交于
      * Implement the device-model infrastructure for loading modules and
        attaching drivers to nvdimm devices.  This is a simple association of a
        nd-device-type number with a driver that has a bitmask of supported
        device types.  To facilitate userspace bind/unbind operations 'modalias'
        and 'devtype', that also appear in the uevent, are added as generic
        sysfs attributes for all nvdimm devices.  The reason for the device-type
        number is to support sub-types within a given parent devtype, be it a
        vendor-specific sub-type or otherwise.
      
      * The first consumer of this infrastructure is the driver
        for dimm devices.  It simply uses control messages to retrieve and
        store the configuration-data image (label set) from each dimm.
      
      Note: nd_device_register() arranges for asynchronous registration of
            nvdimm bus devices by default.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4d88a97a
    • D
      libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices · 62232e45
      Dan Williams 提交于
      Most discovery/configuration of the nvdimm-subsystem is done via sysfs
      attributes.  However, some nvdimm_bus instances, particularly the
      ACPI.NFIT bus, define a small set of messages that can be passed to the
      platform.  For convenience we derive the initial libnvdimm-ioctl command
      formats directly from the NFIT DSM Interface Example formats.
      
          ND_CMD_SMART: media health and diagnostics
          ND_CMD_GET_CONFIG_SIZE: size of the label space
          ND_CMD_GET_CONFIG_DATA: read label space
          ND_CMD_SET_CONFIG_DATA: write label space
          ND_CMD_VENDOR: vendor-specific command passthrough
          ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
          ND_CMD_ARS_START: initiate scrubbing
          ND_CMD_ARS_STATUS: report on scrubbing state
          ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events
      
      If a platform later defines different commands than this set it is
      straightforward to extend support to those formats.
      
      Most of the commands target a specific dimm.  However, the
      address-range-scrubbing commands target the bus.  The 'commands'
      attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
      commands for that object.
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      62232e45
  20. 24 6月, 2015 3 次提交
    • P
      net: inet_diag: export IPV6_V6ONLY sockopt · 20462155
      Phil Sutter 提交于
      For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
      exported to userspace. It indicates whether a socket bound to in6addr_any
      listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
      listed by e.g. 'ss -l -4'.
      
      This patch is accompanied by an appropriate one for iproute2 to enable
      the additional information in 'ss -e'.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20462155
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
    • A
      net: track link-status of ipv4 nexthops · 8a3d0316
      Andy Gospodarek 提交于
      Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
      reachable via an interface where carrier is off.  No action is taken,
      but additional flags are passed to userspace to indicate carrier status.
      
      This also includes a cleanup to fib_disable_ip to more clearly indicate
      what event made the function call to replace the more cryptic force
      option previously used.
      
      v2: Split out kernel functionality into 2 patches, this patch simply
      sets and clears new nexthop flag RTNH_F_LINKDOWN.
      
      v3: Cleanups suggested by Alex as well as a bug noticed in
      fib_sync_down_dev and fib_sync_up when multipath was not enabled.
      
      v5: Whitespace and variable declaration fixups suggested by Dave.
      
      v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
      all.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3d0316
  21. 23 6月, 2015 2 次提交
新手
引导
客服 返回
顶部