1. 10 2月, 2015 1 次提交
    • Y
      IB/mlx4: Reset flow support for IB kernel ULPs · 35f05dab
      Yishai Hadas 提交于
      The driver exposes interfaces that directly relate to HW state. Upon fatal
      error, consumers of these interfaces (ULPs) that rely on completion of
      all their posted work-request could hang, thereby introducing dependencies
      in shutdown order.  To prevent this from happening, we manage the
      relevant resources (CQs, QPs) that are used by the device. Upon a fatal error,
      we now generate simulated completions for outstanding WQEs that were not
      completed at the time the HW was reset.
      
      It includes invoking the completion event handler for all involved CQs so that
      the ULPs will poll those CQs. When polled we return simulated CQEs with
      IB_WC_WR_FLUSH_ERR return code enabling ULPs to clean up their resources and
      not wait forever for completions upon receiving remove_one.
      
      The above change requires an extra check in the data path to make sure that when
      device is in error state, the simulated CQEs will be returned and no further
      WQEs will be posted.
      Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f05dab
  2. 09 2月, 2015 3 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
    • S
      net: fix a typo in skb_checksum_validate_zero_check · 096a4cfa
      Sabrina Dubroca 提交于
      Remove trailing underscore.
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      096a4cfa
  3. 08 2月, 2015 3 次提交
  4. 05 2月, 2015 6 次提交
    • H
      rhashtable: Introduce rhashtable_walk_* · f2dba9c6
      Herbert Xu 提交于
      Some existing rhashtable users get too intimate with it by walking
      the buckets directly.  This prevents us from easily changing the
      internals of rhashtable.
      
      This patch adds the helpers rhashtable_walk_init/exit/start/next/stop
      which will replace these custom walkers.
      
      They are meant to be usable for both procfs seq_file walks as well
      as walking by a netlink dump.  The iterator structure should fit
      inside a netlink dump cb structure, with at least one element to
      spare.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2dba9c6
    • M
      net/mlx4_core: Port aggregation upper layer interface · 53f33ae2
      Moni Shoua 提交于
      Supply interface functions to bond and unbond ports of a mlx4 internal
      interfaces. Example for such an interface is the one registered by the
      mlx4 IB driver under RoCE.
      
      There are
      
      1. Functions to go in/out to/from bonded mode
      2. Function to remap virtual ports to physical ports
      
      The bond_mutex prevents simultaneous access to data that keep status of
      the device in bonded mode.
      
      The upper mlx4 interface marks to the mlx4 core module that they
      want to be subject for such bonding by setting the MLX4_INTFF_BONDING
      flag. Interface which goes to/from bonded mode is re-created.
      
      The mlx4 Ethernet driver does not set this flag when registering the
      interface, the IB driver does.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53f33ae2
    • M
      net/mlx4_core: Port aggregation low level interface · 59e14e32
      Moni Shoua 提交于
      Implement the hardware interface required for port aggregation.
      
      1. Disable RX port check on receive - don't perform a validity check
      that matches to QP's port and the port where the packet is received.
      
      2. Virtual to physical port remap - configure virtual to physical port
      mapping. Port remap capability for virtual functions.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59e14e32
    • M
      net/core: Add event for a change in slave state · 61bd3857
      Moni Shoua 提交于
      Add event which provides an indication on a change in the state
      of a bonding slave. The event handler should cast the pointer to the
      appropriate type (struct netdev_bonding_info) in order to get the
      full info about the slave.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bd3857
    • T
      net: add skb functions to process remote checksum offload · dcdc8994
      Tom Herbert 提交于
      This patch adds skb_remcsum_process and skb_gro_remcsum_process to
      perform the appropriate adjustments to the skb when receiving
      remote checksum offload.
      
      Updated vxlan and gue to use these functions.
      
      Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
      not see any change in performance.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdc8994
    • E
      xps: fix xps for stacked devices · 2bd82484
      Eric Dumazet 提交于
      A typical qdisc setup is the following :
      
      bond0 : bonding device, using HTB hierarchy
      eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
      
      XPS allows to spread packets on specific tx queues, based on the cpu
      doing the send.
      
      Problem is that dequeues from bond0 qdisc can happen on random cpus,
      due to the fact that qdisc_run() can dequeue a batch of packets.
      
      CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
      CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
      
      CPUB -> dequeue packet P1 from bond0
              enqueue packet on eth1/eth2
      CPUC -> dequeue packet P2 from bond0
              enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
      
      get_xps_queue() then might select wrong queue for P1, since current cpu
      might be different than CPUA.
      
      P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
      if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
      
      Effect of this bug is TCP reorders, and more generally not optimal
      TX queue placement. (A victim bulk flow can be migrated to the wrong TX
      queue for a while)
      
      To fix this, we have to record sender cpu number the first time
      dev_queue_xmit() is called for one tx skb.
      
      We can union napi_id (used on receive path) and sender_cpu,
      granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
      this union idea)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd82484
  5. 04 2月, 2015 7 次提交
  6. 03 2月, 2015 2 次提交
  7. 02 2月, 2015 3 次提交
    • R
      bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellink · add511b3
      Roopa Prabhu 提交于
      bridge flags are needed inside ndo_bridge_setlink/dellink handlers to
      avoid another call to parse IFLA_AF_SPEC inside these handlers
      
      This is used later in this series
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      add511b3
    • R
      netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads · aafb3e98
      Roopa Prabhu 提交于
      This is a high level feature flag for all switch asic offloads
      
      switch drivers set this flag on switch ports. Logical devices like
      bridge, bonds, vxlans can inherit this flag from their slaves/ports.
      
      The patch also adds the flag to NETIF_F_ONE_FOR_ALL, so that it gets
      propagated to the upperdevices (bridges and bonds).
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafb3e98
    • L
      sched: don't cause task state changes in nested sleep debugging · 00845eb9
      Linus Torvalds 提交于
      Commit 8eb23b9f ("sched: Debug nested sleeps") added code to report
      on nested sleep conditions, which we generally want to avoid because the
      inner sleeping operation can re-set the thread state to TASK_RUNNING,
      but that will then cause the outer sleep loop not actually sleep when it
      calls schedule.
      
      However, that's actually valid traditional behavior, with the inner
      sleep being some fairly rare case (like taking a sleeping lock that
      normally doesn't actually need to sleep).
      
      And the debug code would actually change the state of the task to
      TASK_RUNNING internally, which makes that kind of traditional and
      working code not work at all, because now the nested sleep doesn't just
      sometimes cause the outer one to not block, but will cause it to happen
      every time.
      
      In particular, it will cause the cardbus kernel daemon (pccardd) to
      basically busy-loop doing scheduling, converting a laptop into a heater,
      as reported by Bruno Prémont.  But there may be other legacy uses of
      that nested sleep model in other drivers that are also likely to never
      get converted to the new model.
      
      This fixes both cases:
      
       - don't set TASK_RUNNING when the nested condition happens (note: even
         if WARN_ONCE() only _warns_ once, the return value isn't whether the
         warning happened, but whether the condition for the warning was true.
         So despite the warning only happening once, the "if (WARN_ON(..))"
         would trigger for every nested sleep.
      
       - in the cases where we knowingly disable the warning by using
         "sched_annotate_sleep()", don't change the task state (that is used
         for all core scheduling decisions), instead use '->task_state_change'
         that is used for the debugging decision itself.
      
      (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
      avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
      suggested change to use 'task_state_change' as part of the test)
      Reported-and-bisected-by: NBruno Prémont <bonbons@linux-vserver.org>
      Tested-by: NRafael J Wysocki <rjw@rjwysocki.net>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>,
      Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Hurley <peter@hurleysoftware.com>,
      Cc: Davidlohr Bueso <dave@stgolabs.net>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00845eb9
  8. 31 1月, 2015 1 次提交
  9. 30 1月, 2015 2 次提交
    • S
      dev: add per net_device packet type chains · 7866a621
      Salam Noureddine 提交于
      When many pf_packet listeners are created on a lot of interfaces the
      current implementation using global packet type lists scales poorly.
      This patch adds per net_device packet type lists to fix this problem.
      
      The patch was originally written by Eric Biederman for linux-2.6.29.
      Tested on linux-3.16.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSalam Noureddine <noureddine@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7866a621
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  10. 29 1月, 2015 3 次提交
  11. 28 1月, 2015 5 次提交
    • P
      perf: Tighten (and fix) the grouping condition · c3c87e77
      Peter Zijlstra 提交于
      The fix from 9fc81d87 ("perf: Fix events installation during
      moving group") was incomplete in that it failed to recognise that
      creating a group with events for different CPUs is semantically
      broken -- they cannot be co-scheduled.
      
      Furthermore, it leads to real breakage where, when we create an event
      for CPU Y and then migrate it to form a group on CPU X, the code gets
      confused where the counter is programmed -- triggered in practice
      as well by me via the perf fuzzer.
      
      Fix this by tightening the rules for creating groups. Only allow
      grouping of counters that can be co-scheduled in the same context.
      This means for the same task and/or the same cpu.
      
      Fixes: 9fc81d87 ("perf: Fix events installation during moving group")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c3c87e77
    • J
      quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units · 14bf61ff
      Jan Kara 提交于
      Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which
      tracks space limits and usage in 512-byte blocks. However VFS quotas
      track usage in bytes (as some filesystems require that) and we need to
      somehow pass this information. Upto now it wasn't a problem because we
      didn't do any unit conversion (thus VFS quota routines happily stuck
      number of bytes into d_bcount field of struct fd_disk_quota). Only if
      you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA
      / Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone
      tried this but reportedly some Samba users hit the problem in practice.
      So when we want interfaces compatible we need to fix this.
      
      We bite the bullet and define another quota structure used for passing
      information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have
      to have more conversion routines in fs/quota/quota.c and another copying
      of quota structure slows down getting of quota information by about 2%
      but it seems cleaner than overloading e.g. units of d_bcount to bytes.
      
      CC: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      14bf61ff
    • J
      net/mlx4_core: Adjust command timeouts to conform to the firmware spec · 5a031086
      Jack Morgenstein 提交于
      The firmware spec states that the timeout for all commands should be 60 seconds.
      
      In the past, the spec indicated that there were several classes of timeout
      (short, medium, and long).  The driver has these different timeout classes.
      We leave the class differentiation in the driver as-is (to protect against any
      future spec changes), but set the timeout for all classes to be 60 seconds.
      
      In addition, we fix a few commands which had hard-coded numeric timeouts specified.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a031086
    • J
      net/mlx4_core: Add bad-cable event support · be6a6b43
      Jack Morgenstein 提交于
      If the firmware can detect a bad cable, allow it to generate an
      event, and print the problem in the log.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be6a6b43
    • C
      NFC: st21nfca: Adding support for secure element · 2130fb97
      Christophe Ricard 提交于
      st21nfca has 1 physical SWP line and can support up to 2 secure elements
      (UICC & eSE) thanks to an external switch managed with a gpio.
      
      The platform integrator needs to specify thanks to 2 initialization
      properties, uicc-present and ese-present, if it is suppose to have uicc
      and/or ese. Of course if the platform does not have an external switch,
      only one kind of secure element can be supported. Those parameters are
      under platform integrator responsibilities.
      
      During initialization, the white_list will be set according to those
      parameters.
      
      The discovery_se function will assume a secure element is physically
      present according to uicc-present and ese-present values and will add it
      to the secure element list. On ese activation, the atr is retrieved to
      calculate a command exchange timeout based on the first atr(TB) value.
      
      The se_io will allow to transfer data over SWP. 2 kind of events may appear
      after a data is sent over:
      - ST21NFCA_EVT_TRANSMIT_DATA when receiving an apdu answer
      - ST21NFCA_EVT_WTX_REQUEST when the secure element needs more time than
      expected to compute a command. If this timeout expired, a first recovery
      tentative consist to send a simple software reset proprietary command.
      If this tentative still fail, a second recovery tentative consist to send
      a hardware reset proprietary command.
      This function is only relevant for eSE like secure element.
      
      This patch also change the way a pipe is referenced. There can be
      different pipe connected to the same gate with different host destination
      (ex: CONNECTIVITY). In order to keep host information every pipe are
      reference with a tuple (gate, host). In order to reduce changes, we are
      keeping unchanged the way a gate is addressed on the Terminal Host.
      However, this is working because we consider the apdu reader gate is only
      present on the eSE slot also the connectivity gate cannot give a reliable
      value; it will give the latest stored pipe value.
      Signed-off-by: NChristophe Ricard <christophe-h.ricard@st.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      2130fb97
  12. 27 1月, 2015 4 次提交