1. 15 10月, 2017 1 次提交
  2. 14 10月, 2017 1 次提交
    • A
      i40e/i40evf: don't trust VF to reset itself · 17a9422d
      Alan Brady 提交于
      When using 'ethtool -L' on a VF to change number of requested queues
      from PF, we shouldn't trust the VF to reset itself after making the
      request.  Doing it that way opens the door for a potentially malicious
      VF to do nasty things to the PF which should never be the case.
      
      This makes it such that after VF makes a successful request, PF will
      then reset the VF to institute required changes.  Only if the request
      fails will PF send a message back to VF letting it know the request was
      unsuccessful.
      
      Testing-hints:
      There should be no real functional changes.  This is simply hardening
      against a potentially malicious VF.
      Signed-off-by: NAlan Brady <alan.brady@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      17a9422d
  3. 13 10月, 2017 1 次提交
  4. 11 10月, 2017 3 次提交
  5. 10 10月, 2017 5 次提交
  6. 09 10月, 2017 2 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
    • R
      bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood · 821f1b21
      Roopa Prabhu 提交于
      This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
      suppress arp and nd flood on bridge ports. It implements
      rfc7432, section 10.
      https://tools.ietf.org/html/rfc7432#section-10
      for ethernet VPN deployments. It is similar to the existing
      BR_PROXYARP* flags but has a few semantic differences to conform
      to EVPN standard. Unlike the existing flags, this new flag suppresses
      flood of all neigh discovery packets (arp and nd) to tunnel ports.
      Supports both vlan filtering and non-vlan filtering bridges.
      
      In case of EVPN, it is mainly used to avoid flooding
      of arp and nd packets to tunnel ports like vxlan.
      
      This patch adds netlink and sysfs support to set this bridge port
      flag.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      821f1b21
  7. 08 10月, 2017 3 次提交
  8. 07 10月, 2017 1 次提交
  9. 06 10月, 2017 1 次提交
    • E
      tcp: new list for sent but unacked skbs for RACK recovery · e2080072
      Eric Dumazet 提交于
      This patch adds a new queue (list) that tracks the sent but not yet
      acked or SACKed skbs for a TCP connection. The list is chronologically
      ordered by skb->skb_mstamp (the head is the oldest sent skb).
      
      This list will be used to optimize TCP Rack recovery, which checks
      an skb's timestamp to judge if it has been lost and needs to be
      retransmitted. Since TCP write queue is ordered by sequence instead
      of sent time, RACK has to scan over the write queue to catch all
      eligible packets to detect lost retransmission, and iterates through
      SACKed skbs repeatedly.
      
      Special cares for rare events:
      1. TCP repair fakes skb transmission so the send queue needs adjusted
      2. SACK reneging would require re-inserting SACKed skbs into the
         send queue. For now I believe it's not worth the complexity to
         make RACK work perfectly on SACK reneging, so we do nothing here.
      3. Fast Open: currently for non-TFO, send-queue correctly queues
         the pure SYN packet. For TFO which queues a pure SYN and
         then a data packet, send-queue only queues the data packet but
         not the pure SYN due to the structure of TFO code. This is okay
         because the SYN receiver would never respond with a SACK on a
         missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
      
      In order to not grow sk_buff, we use an union for the new list and
      _skb_refdst/destructor fields. This is a bit complicated because
      we need to make sure _skb_refdst and destructor are properly zeroed
      before skb is cloned/copied at transmit, and before being freed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2080072
  10. 05 10月, 2017 6 次提交
  11. 04 10月, 2017 13 次提交
    • T
      powerpc/watchdog: Make use of watchdog_nmi_probe() · 34ddaa3e
      Thomas Gleixner 提交于
      The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
      on powerpc because it is called multiple times for the boot CPU.
      
      The first call is via:
      
        start_wd_on_cpu+0x80/0x2f0
        watchdog_nmi_reconfigure+0x124/0x170
        softlockup_reconfigure_threads+0x110/0x130
        lockup_detector_init+0xbc/0xe0
        kernel_init_freeable+0x18c/0x37c
        kernel_init+0x2c/0x160
        ret_from_kernel_thread+0x5c/0xbc
      
      And then again via the CPU hotplug registration:
      
        start_wd_on_cpu+0x80/0x2f0
        cpuhp_invoke_callback+0x194/0x620
        cpuhp_thread_fun+0x7c/0x1b0
        smpboot_thread_fn+0x290/0x2a0
        kthread+0x168/0x1b0
        ret_from_kernel_thread+0x5c/0xbc
      
      This can be avoided by setting up the cpu hotplug state with nocalls and
      move the initialization to the watchdog_nmi_probe() function. That
      initializes the hotplug callbacks without invoking the callback and the
      following core initialization function then configures the watchdog for the
      online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
      Reported-and-tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      34ddaa3e
    • T
      watchdog/core, powerpc: Replace watchdog_nmi_reconfigure() · 6b9dc480
      Thomas Gleixner 提交于
      The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
      into two stages. One to stop the NMI and one to restart it after
      reconfiguration. That was done by adding a boolean 'run' argument to the
      code, which is functionally correct but not necessarily a piece of art.
      
      Replace it by two explicit functions: watchdog_nmi_stop() and
      watchdog_nmi_start().
      
      Fixes: 6592ad2f ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
      Requested-by: NLinus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: NThomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
      6b9dc480
    • L
      mmc: Delete bounce buffer handling · de3ee99b
      Linus Walleij 提交于
      In may, Steven sent a patch deleting the bounce buffer handling
      and the CONFIG_MMC_BLOCK_BOUNCE option.
      
      I chose the less invasive path of making it a runtime config
      option, and we merged that successfully for kernel v4.12.
      
      The code is however just standing in the way and taking up
      space for seemingly no gain on any systems in wide use today.
      
      Pierre says the code was there to improve speed on TI SDHCI
      controllers on certain HP laptops and possibly some Ricoh
      controllers as well. Early SDHCI controllers lacked the
      scatter-gather feature, which made software bounce buffers
      a significant speed boost.
      
      We are clearly talking about the list of SDHCI PCI-based
      MMC/SD card readers found in the pci_ids[] list in
      drivers/mmc/host/sdhci-pci-core.c.
      
      The TI SDHCI derivative is not supported by the upstream
      kernel. This leaves the Ricoh.
      
      What we can however notice is that the x86 defconfigs in the
      kernel did not enable CONFIG_MMC_BLOCK_BOUNCE option, which
      means that any such laptop would have to have a custom
      configured kernel to actually take advantage of this
      bounce buffer speed-up. It simply seems like there was
      a speed optimization for the Ricoh controllers that noone
      was using. (I have not checked the distro defconfigs but
      I am pretty sure the situation is the same there.)
      
      Bounce buffers increased performance on the OMAP HSMMC
      at one point, and was part of the original submission in
      commit a45c6cb8 ("[ARM] 5369/1: omap mmc: Add new
         omap hsmmc controller for 2430 and 34xx, v3")
      
      This optimization was removed in
      commit 0ccd76d4 ("omap_hsmmc: Implement scatter-gather
         emulation")
      which found that scatter-gather emulation provided even
      better performance.
      
      The same was introduced for SDHCI in
      commit 2134a922 ("sdhci: scatter-gather (ADMA) support")
      
      I am pretty positively convinced that software
      scatter-gather emulation will do for any host controller what
      the bounce buffers were doing. Essentially, the bounce buffer
      was a reimplementation of software scatter-gather-emulation in
      the MMC subsystem, and it should be done away with.
      
      Cc: Pierre Ossman <pierre@ossman.eu>
      Cc: Juha Yrjola <juha.yrjola@solidboot.com>
      Cc: Steven J. Hill <Steven.Hill@cavium.com>
      Cc: Shawn Lin <shawn.lin@rock-chips.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Suggested-by: NSteven J. Hill <Steven.Hill@cavium.com>
      Suggested-by: NShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      de3ee99b
    • M
      include/linux/fs.h: fix comment about struct address_space · 32e57c29
      Mike Rapoport 提交于
      Before commit 9c5d760b ("mm: split gfp_mask and mapping flags into
      separate fields") the private_* fields of struct adrress_space were
      grouped together and using "ditto" in comments describing the last
      fields was correct.
      
      With introduction of gpf_mask between private_lock and private_list
      "ditto" references the wrong description.
      
      Fix it by using the elaborate description.
      
      Link: http://lkml.kernel.org/r/1507009987-8746-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e57c29
    • Y
      mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function · 1dd2bfc8
      YASUAKI ISHIMATSU 提交于
      pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
      pfn_to_section_nr() has no issue even if it is defined as macro.  But
      section_nr_to_pfn() has overflow issue if sec is defined as int.
      
      section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT.  If sec is
      defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
      But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
      value.
      
      __remove_section() calculates start_pfn using section_nr_to_pfn() and
      scn_nr defined as int.  So if hot-removed memory address is over 16TB,
      overflow issue occurs and section_nr_to_pfn() does not calculate correct
      pfn.
      
      To make callers use proper arg, the patch changes the macros to inline
      functions.
      
      Fixes: 815121d2 ("memory_hotplug: clear zone when removing the memory")
      Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.comSigned-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd2bfc8
    • M
    • O
      exec: load_script: kill the onstack interp[BINPRM_BUF_SIZE] array · c2315c18
      Oleg Nesterov 提交于
      Patch series "exec: binfmt_misc: fix use-after-free, kill
      iname[BINPRM_BUF_SIZE]".
      
      It looks like this code was always wrong, then commit 948b701a
      ("binfmt_misc: add persistent opened binary handler for containers")
      added more problems.
      
      This patch (of 6):
      
      load_script() can simply use i_name instead, it points into bprm->buf[]
      and nobody can change this memory until we call prepare_binprm().
      
      The only complication is that we need to also change the signature of
      bprm_change_interp() but this change looks good too.
      
      While at it, do whitespace/style cleanups.
      
      NOTE: the real motivation for this change is that people want to
      increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
      this looks more complicated because afaics it is very buggy.
      
      Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Travis Gummels <tgummels@redhat.com>
      Cc: Ben Woodard <woodard@redhat.com>
      Cc: Jim Foraker <foraker1@llnl.gov>
      Cc: <tdhooge@llnl.gov>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2315c18
    • S
      android: binder: drop lru lock in isolate callback · a1b2289c
      Sherry Yang 提交于
      Drop the global lru lock in isolate callback before calling
      zap_page_range which calls cond_resched, and re-acquire the global lru
      lock before returning.  Also change return code to LRU_REMOVED_RETRY.
      
      Use mmput_async when fail to acquire mmap sem in an atomic context.
      
      Fix "BUG: sleeping function called from invalid context"
      errors when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
      
      Also restore mmput_async, which was initially introduced in commit
      ec8d7c14 ("mm, oom_reaper: do not mmput synchronously from the oom
      reaper context"), and was removed in commit 21292580 ("mm: oom: let
      oom_reap_task and exit_mmap run concurrently").
      
      Link: http://lkml.kernel.org/r/20170914182231.90908-1-sherryy@android.com
      Fixes: f2517eb7 ("android: binder: Add global lru shrinker to binder")
      Signed-off-by: NSherry Yang <sherryy@android.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reported-by: NKyle Yan <kyan@codeaurora.org>
      Acked-by: NArve Hjønnevåg <arve@android.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Martijn Coenen <maco@google.com>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Riley Andrews <riandrews@android.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hoeun Ryu <hoeun.ryu@gmail.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1b2289c
    • M
      mm, oom_reaper: skip mm structs with mmu notifiers · 4d4bbd85
      Michal Hocko 提交于
      Andrea has noticed that the oom_reaper doesn't invalidate the range via
      mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
      corrupt the memory of the kvm guest for example.
      
      tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
      sufficient as per Andrea:
      
       "mmu_notifier_invalidate_range cannot be used in replacement of
        mmu_notifier_invalidate_range_start/end. For KVM
        mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
        notifier implementation has to implement either ->invalidate_range
        method or the invalidate_range_start/end methods, not both. And if you
        implement invalidate_range_start/end like KVM is forced to do, calling
        mmu_notifier_invalidate_range in common code is a noop for KVM.
      
        For those MMU notifiers that can get away only implementing
        ->invalidate_range, the ->invalidate_range is implicitly called by
        mmu_notifier_invalidate_range_end(). And only those secondary MMUs
        that share the same pagetable with the primary MMU (like AMD iommuv2)
        can get away only implementing ->invalidate_range"
      
      As the callback is allowed to sleep and the implementation is out of
      hand of the MM it is safer to simply bail out if there is an mmu
      notifier registered.  In order to not fail too early make the
      mm_has_notifiers check under the oom_lock and have a little nap before
      failing to give the current oom victim some more time to exit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d4bbd85
    • K
      include/linux/mm.h: fix typo in VM_MPX definition · fa87b91c
      Kirill A. Shutemov 提交于
      There's a typo in recent change of VM_MPX definition.  We want it to be
      VM_HIGH_ARCH_4, not VM_HIGH_ARCH_BIT_4.
      
      This bug does cause visible regressions.  In arch_vma_name the vmflags
      are tested against VM_MPX.  With the incorrect value of VM_MPX, a number
      of vmas (such as the stack) test positive and end up being marked as
      "[mpx]" in /proc/N/maps instead of their correct names.
      
      This confuses tools like rr which expect to be able to find familiar
      vmas.
      
      Fixes: df3735c5 ("x86,mpx: make mpx depend on x86-64 to free up VMA flag")
      Link: http://lkml.kernel.org/r/20170918140253.36856-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kyle Huey <me@kylehuey.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa87b91c
    • F
      net: core: decouple ifalias get/set from rtnl lock · 6c557001
      Florian Westphal 提交于
      Device alias can be set by either rtnetlink (rtnl is held) or sysfs.
      
      rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
      Add an extra mutex for it and use rcu to protect concurrent accesses.
      
      This allows the sysfs path to not take rtnl and would later allow
      to not hold it when dumping ifalias.
      
      Based on suggestion from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c557001
    • Y
      ipv4: ipmr: Add the parent ID field to VIF struct · 5d8b3e69
      Yotam Gigi 提交于
      In order to allow the ipmr module to do partial multicast forwarding
      according to the device parent ID, add the device parent ID field to the
      VIF struct. This way, the forwarding path can use the parent ID field
      without invoking switchdev calls, which requires the RTNL lock.
      
      When a new VIF is added, set the device parent ID field in it by invoking
      the switchdev_port_attr_get call.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d8b3e69
    • Y
      skbuff: Add the offload_mr_fwd_mark field · abf4bb6b
      Yotam Gigi 提交于
      Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
      used to allow partial offloading of MFC multicast routes.
      
      Switchdev drivers can offload MFC multicast routes to the hardware by
      registering to the FIB notification chain. When one of the route output
      interfaces is not offload-able, i.e. has different parent ID, the route
      cannot be fully offloaded by the hardware. Examples to non-offload-able
      devices are a management NIC, dummy device, pimreg device, etc.
      
      Similar problem exists in the bridge module, as one bridge can hold
      interfaces with different parent IDs. At the bridge, the problem is solved
      by the offload_fwd_mark skb field.
      
      Currently, when a route cannot go through full offload, the only solution
      for a switchdev driver is not to offload it at all and let the packet go
      through slow path.
      
      Using the offload_mr_fwd_mark field, a driver can indicate that a packet
      was already forwarded by hardware to all the devices with the same parent
      ID as the input device. Further patches in this patch-set are going to
      enhance ipmr to skip multicast forwarding to devices with the same parent
      ID if a packets is marked with that field.
      
      The reason why the already existing "offload_fwd_mark" bit cannot be used
      is that a switchdev driver would want to make the distinction between a
      packet that has already gone through L2 forwarding but did not go through
      multicast forwarding, and a packet that has already gone through both L2
      and multicast forwarding.
      
      For example: when a packet is ingressing from a switchport enslaved to a
      bridge, which is configured with multicast forwarding, the following
      scenarios are possible:
       - The packet can be trapped to the CPU due to exception while multicast
         forwarding (for example, MTU error). In that case, it had already gone
         through L2 forwarding in the hardware, thus A switchdev driver would
         want to set the skb->offload_fwd_mark and not the
         skb->offload_mr_fwd_mark.
       - The packet can also be trapped due to a pimreg/dummy device used as one
         of the output interfaces. In that case, it can go through both L2 and
         (partial) multicast forwarding inside the hardware, thus a switchdev
         driver would want to set both the skb->offload_fwd_mark and
         skb->offload_mr_fwd_mark.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellaox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abf4bb6b
  12. 03 10月, 2017 3 次提交