1. 14 12月, 2014 11 次提交
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • J
      mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE · 2d48366b
      Jianyu Zhan 提交于
      GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
      defined, also implied by their names:
      
      GFP_USER                                  = GFP_USER
      GFP_USER + __GFP_HIGHMEM                  = GFP_HIGHUSER
      GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE  = GFP_HIGHUSER_MOVABLE
      
      So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
      reflect this fact.  It also makes the definition clear and texturally warn
      on any furture break-up of this escalated relastionship.
      Signed-off-by: NJianyu Zhan <jianyu.zhan@emc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d48366b
    • A
      include/linux/kmemleak.h: needs slab.h · 66f2ca7e
      Andrew Morton 提交于
      include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
      include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)
      
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f2ca7e
    • Z
      mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() · 056b7cce
      Zhang Zhen 提交于
      The gfp was passed in but never used in this function.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      056b7cce
    • T
      mm: move swp_entry_t definition to include/linux/mm_types.h · bd6dace7
      Tejun Heo 提交于
      swp_entry_t being defined in include/linux/swap.h instead of
      include/linux/mm_types.h causes cyclic include dependency later when
      include/linux/page_cgroup.h is included from writeback path.  Move the
      definition to include/linux/mm_types.h.
      
      While at it, reformat the comment above it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd6dace7
    • V
      memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29
      Vladimir Davydov 提交于
      It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
      the task_struct.
      
      Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
      set/clear the flag inline.  Regarding the overwhelming comment to the
      helpers, which is removed by this patch too, we already have a compact yet
      accurate explanation in memcg_schedule_cache_create, no need in yet
      another one.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f185c29
    • M
      lib: bitmap: add alignment offset for bitmap_find_next_zero_area() · 5e19b013
      Michal Nazarewicz 提交于
      Add a bitmap_find_next_zero_area_off() function which works like
      bitmap_find_next_zero_area() function except it allows an offset to be
      specified when alignment is checked.  This lets caller request a bit such
      that its number plus the offset is aligned according to the mask.
      
      [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kukjin Kim <kgene.kim@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e19b013
    • D
      mm/rmap: share the i_mmap_rwsem · 3dec0ba0
      Davidlohr Bueso 提交于
      Similarly to the anon memory counterpart, we can share the mapping's lock
      ownership as the interval tree is not modified when doing doing the walk,
      only the file page.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec0ba0
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621
      Davidlohr Bueso 提交于
      This series is a continuation of the conversion of the i_mmap_mutex to
      rwsem, following what we have for the anon memory counterpart.  With
      Hugh's feedback from the first iteration.
      
      Ultimately, the most obvious paths that require exclusive ownership of the
      lock is when we modify the VMA interval tree, via
      vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
      such as unmapping, where the ptes content is changed but the tree remains
      untouched should make it safe to share the i_mmap_rwsem.
      
      As such, the code of course is straightforward, however the devil is very
      much in the details.  While its been tested on a number of workloads
      without anything exploding, I would not be surprised if there are some
      less documented/known assumptions about the lock that could suffer from
      these changes.  Or maybe I'm just missing something, but either way I
      believe its at the point where it could use more eyes and hopefully some
      time in linux-next.
      
      Because the lock type conversion is the heart of this patchset,
      its worth noting a few comparisons between mutex vs rwsem (xadd):
      
        (i) Same size, no extra footprint.
      
        (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
             exclusive lock ownership.
      
        (iii) Both can be slightly unfair wrt exclusive ownership, with
              writer lock stealing properties, not necessarily respecting
              FIFO order for granting the lock when contended.
      
        (iv) Mutexes can be slightly faster than rwsems when
             the lock is non-contended.
      
        (v) Both suck at performance for debug (slowpaths), which
            shouldn't matter anyway.
      
      Sharing the lock is obviously beneficial, and sem writer ownership is
      close enough to mutexes.  The biggest winner of these changes is
      migration.
      
      As for concrete numbers, the following performance results are for a
      4-socket 60-core IvyBridge-EX with 130Gb of RAM.
      
      Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
      with this set, with a steady ~60% throughput (jpm) increase for alltests
      and up to ~30% for disk for high amounts of concurrency.  Lower counts of
      workload users (< 100) does not show much difference at all, so at least
      no regressions.
      
                          3.18-rc1            3.18-rc1-i_mmap_rwsem
      alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
      alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
      alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
      alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
      alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
      alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
      alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
      alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
      alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
      alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
      alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
      alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
      alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
      alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
      alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
      alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
      alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
      alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
      alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
      alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)
      
      disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
      disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
      disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
      disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
      disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
      disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
      disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
      disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
      disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
      disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
      disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
      disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
      disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
      disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
      disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
      disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
      disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
      disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
      disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
      disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)
      
      In addition, a 67.5% increase in successfully migrated NUMA pages, thus
      improving node locality.
      
      The patch layout is simple but designed for bisection (in case reversion
      is needed if the changes break upstream) and easier review:
      
      o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
      o Patches 5-10 share the lock in specific paths, each patch
        details the rationale behind why it should be safe.
      
      This patchset has been tested with: postgres 9.4 (with brand new hugetlb
      support), hugetlbfs test suite (all tests pass, in fact more tests pass
      with these changes than with an upstream kernel), ltp, aim7 benchmarks,
      memcached and iozone with the -B option for mmap'ing.  *Untested* paths
      are nommu, memory-failure, uprobes and xip.
      
      This patch (of 8):
      
      Various parts of the kernel acquire and release this mutex, so add
      i_mmap_lock_write() and immap_unlock_write() helper functions that will
      encapsulate this logic.  The next patch will make use of these.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b28f621
  2. 12 12月, 2014 5 次提交
    • T
      pstore-ram: Allow optional mapping with pgprot_noncached · 027bc8b0
      Tony Lindgren 提交于
      On some ARMs the memory can be mapped pgprot_noncached() and still
      be working for atomic operations. As pointed out by Colin Cross
      <ccross@android.com>, in some cases you do want to use
      pgprot_noncached() if the SoC supports it to see a debug printk
      just before a write hanging the system.
      
      On ARMs, the atomic operations on strongly ordered memory are
      implementation defined. So let's provide an optional kernel parameter
      for configuring pgprot_noncached(), and use pgprot_writecombine() by
      default.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: stable@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      027bc8b0
    • M
      net/mlx4: Add support for A0 steering · 7d077cd3
      Matan Barak 提交于
      Add the required firmware commands for A0 steering and a way to enable
      that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT,
      QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used
      to configure and query the device.
      
      The different A0 DMFS (steering) modes are:
      
      Static - optimized performance, but flow steering rules are
      limited. This mode should be choosed explicitly by the user
      in order to be used.
      
      Dynamic - this mode should be explicitly choosed by the user.
      In this mode, the FW works in optimized steering mode as long as
      it can and afterwards automatically drops to classic (full) DMFS.
      
      Disable - this mode should be explicitly choosed by the user.
      The user instructs the system not to use optimized steering, even if
      the FW supports Dynamic A0 DMFS (and thus will be able to use optimized
      steering in Default A0 DMFS mode).
      
      Default - this mode is implicitly choosed. In this mode, if the FW
      supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll
      work at Disable A0 DMFS mode.
      
      Under SRIOV configuration, when the A0 steering mode is enabled,
      older guest VF drivers who aren't using the RX QP allocation flag
      (MLX4_RESERVE_A0_QP) will get a QP from the general range and
      fail when attempting to register a steering rule. To avoid that,
      the PF context behaviour is changed once on A0 static mode, to
      require support for the allocation flag in VF drivers too.
      
      In order to enable A0 steering, we use log_num_mgm_entry_size param.
      If the value of the parameter is not positive, we treat the absolute
      value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this
      bit field enables static A0 steering.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d077cd3
    • M
      net/mlx4: Add A0 hybrid steering · d57febe1
      Matan Barak 提交于
      A0 hybrid steering is a form of high performance flow steering.
      By using this mode, mlx4 cards use a fast limited table based steering,
      in order to enable fast steering of unicast packets to a QP.
      
      In order to implement A0 hybrid steering we allocate resources
      from different zones:
      (1) General range
      (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.
      
      When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
      we try hard to allocate the QP from range (2). Otherwise, we try hard not
      to allocate from this  range. However, when the system is pushed to its
      limits and one needs every resource, the allocator uses every region it can.
      
      Meaning, when we run out of raw-eth qps, the allocator allocates from the
      general range (and the special-A0 area is no longer active). If we run out
      of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
      is also exhausted, the allocator will allocate from the general range
      (and the A0 region is no longer active).
      
      Note that if a raw-eth qp is allocated from the general range, it attempts
      to allocate the range such that bits 6 and 7 (blueflame bits) in the
      QP number are not set.
      
      When the feature is used in SRIOV, the VF has to notify the PF what
      kind of QP attributes it needs. In order to do that, along with the
      "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
      to the combination of these bits, the PF tries to allocate a suitable QP.
      
      In order to maintain backward compatibility (with older PFs), the PF
      notifies which QP attributes it supports via QUERY_FUNC_CAP command.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d57febe1
    • E
      net/mlx4: Change QP allocation scheme · ddae0349
      Eugenia Emantayev 提交于
      When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields
      in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset.
      
      The current Ethernet driver code reserves a Tx QP range with 256b alignment.
      
      This is wrong because if there are more than 64 Tx QPs in use,
      QPNs >= base + 65 will have bits 6/7 set.
      
      This problem is not specific for the Ethernet driver, any entity that
      tries to reserve more than 64 BF-enabled QPs should fail. Also, using
      ranges is not necessary here and is wasteful.
      
      The new mechanism introduced here will support reservation for
      "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
      (when hypervisors support WC in VMs). The flow we use is:
      
      1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
         and request "BF enabled QPs" if BF is supported for the function
      
      2. In the ALLOC_RES FW command, change param1 to:
      a. param1[23:0]  - number of QPs
      b. param1[31-24] - flags controlling QPs reservation
      
      Bit 31 refers to Eth blueflame supported QPs. Those QPs must have
      bits 6 and 7 unset in order to be used in Ethernet.
      
      Bits 24-30 of the flags are currently reserved.
      
      When a function tries to allocate a QP, it states the required attributes
      for this QP. Those attributes are considered "best-effort". If an attribute,
      such as Ethernet BF enabled QP, is a must-have attribute, the function has
      to check that attribute is supported before trying to do the allocation.
      
      In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
      which are unsupported. If SRIOV is used, the PF validates those attributes
      and masks out unsupported attributes as well. In order to notify VFs which
      attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's
      mailbox is filled by the PF, which notifies which QP allocation attributes
      it supports.
      Signed-off-by: NEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddae0349
    • M
      net/mlx4_core: Use tasklet for user-space CQ completion events · 3dca0f42
      Matan Barak 提交于
      Previously, we've fired all our completion callbacks straight from our ISR.
      
      Some of those callbacks were lightweight (for example, mlx4_en's and
      IPoIB napi callbacks), but some of them did more work (for example,
      the user-space RDMA stack uverbs' completion handler). Besides that,
      doing more than the minimal work in ISR is generally considered wrong,
      it could even lead to a hard lockup of the system. Since when a lot
      of completion events are generated by the hardware, the loop over those
      events could be so long, that we'll get into a hard lockup by the system
      watchdog.
      
      In order to avoid that, add a new way of invoking completion events
      callbacks. In the interrupt itself, we add the CQs which receive completion
      event to a per-EQ list and schedule a tasklet. In the tasklet context
      we loop over all the CQs in the list and invoke the user callback.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dca0f42
  3. 11 12月, 2014 24 次提交
新手
引导
客服 返回
顶部