1. 22 11月, 2014 10 次提交
  2. 20 11月, 2014 6 次提交
  3. 19 11月, 2014 2 次提交
    • A
      bpf: allow eBPF programs to use maps · d0003ec0
      Alexei Starovoitov 提交于
      expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
      map accessors to eBPF programs
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0003ec0
    • A
      bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command · 3274f520
      Alexei Starovoitov 提交于
      the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
      either update existing map element or create a new one.
      Initially the plan was to add a new command to handle the case of
      'create new element if it didn't exist', but 'flags' style looks
      cleaner and overall diff is much smaller (more code reused), so add 'flags'
      attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
       #define BPF_ANY	0 /* create new element or update existing */
       #define BPF_NOEXIST	1 /* create new element if it didn't exist */
       #define BPF_EXIST	2 /* update existing element */
      
      bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
      if element already exists.
      
      bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
      if element doesn't exist.
      
      Userspace will call it as:
      int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
      {
          union bpf_attr attr = {
              .map_fd = fd,
              .key = ptr_to_u64(key),
              .value = ptr_to_u64(value),
              .flags = flags;
          };
      
          return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
      }
      
      First two bits of 'flags' are used to encode style of bpf_update_elem() command.
      Bits 2-63 are reserved for future use.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3274f520
  4. 18 11月, 2014 1 次提交
  5. 17 11月, 2014 3 次提交
    • A
      ieee802154: rename and move WPAN_NUM_ defines · cb41c8dd
      Alexander Aring 提交于
      This patch moves the 802.15.4 constraints WPAN_NUM_ defines into
      "net/ieee802154.h" which should contain all necessary 802.15.4 related
      information. Also rename these defines to a common name which is
      IEEE802154_MAX_CHANNEL and IEEE802154_MAX_PAGE.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      cb41c8dd
    • E
      mlx4: use netdev_rss_key_fill() helper · b9d1ab7e
      Eric Dumazet 提交于
      Use of well known RSS key increases attack surface.
      Switch to a random one, using generic helper so that all
      ports share a common key.
      
      Also provide ethtool -x support to fetch RSS key
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Amir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9d1ab7e
    • E
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet 提交于
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb622
  6. 16 11月, 2014 2 次提交
  7. 15 11月, 2014 2 次提交
    • V
      inetdevice: fixed signed integer overflow · 84bc8868
      Vincent BENAYOUN 提交于
      There could be a signed overflow in the following code.
      
      The expression, (32-logmask) is comprised between 0 and 31 included.
      It may be equal to 31.
      In such a case the left shift will produce a signed integer overflow.
      According to the C99 Standard, this is an undefined behavior.
      A simple fix is to replace the signed int 1 with the unsigned int 1U.
      Signed-off-by: NVincent BENAYOUN <vincent.benayoun@trust-in-soft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84bc8868
    • J
      Revert "fast_hash: avoid indirect function calls" · a77f9c5d
      Jay Vosburgh 提交于
      This reverts commit e5a2c899.
      
      	Commit e5a2c899 introduced an alternative_call, arch_fast_hash2,
      that selects between __jhash2 and __intel_crc4_2_hash based on the
      X86_FEATURE_XMM4_2.
      
      	Unfortunately, the alternative_call system does not appear to be
      suitable for use with C functions, as register usage is not handled
      properly for the called functions.  The __jhash2 function in particular
      clobbers registers that are not preserved when called via
      alternative_call, resulting in a panic for direct callers of
      arch_fast_hash2 on older CPUs lacking sse4_2.  It is possible that
      __intel_crc4_2_hash works merely by chance because it uses fewer
      registers.
      
      	This commit was suggested as the source of the problem by Jesse
      Gross <jesse@nicira.com>.
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a77f9c5d
  8. 14 11月, 2014 7 次提交
    • T
      mem-hotplug: reset node managed pages when hot-adding a new pgdat · f784a3f1
      Tang Chen 提交于
      In free_area_init_core(), zone->managed_pages is set to an approximate
      value for lowmem, and will be adjusted when the bootmem allocator frees
      pages into the buddy system.
      
      But free_area_init_core() is also called by hotadd_new_pgdat() when
      hot-adding memory.  As a result, zone->managed_pages of the newly added
      node's pgdat is set to an approximate value in the very beginning.
      
      Even if the memory on that node has node been onlined,
      /sys/device/system/node/nodeXXX/meminfo has wrong value:
      
        hot-add node2 (memory not onlined)
        cat /sys/device/system/node/node2/meminfo
        Node 2 MemTotal:       33554432 kB
        Node 2 MemFree:               0 kB
        Node 2 MemUsed:        33554432 kB
        Node 2 Active:                0 kB
      
      This patch fixes this problem by reset node managed pages to 0 after
      hot-adding a new node.
      
      1. Move reset_managed_pages_done from reset_node_managed_pages() to
         reset_all_zones_managed_pages()
      2. Make reset_node_managed_pages() non-static
      3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
         is initialized
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f784a3f1
    • J
      mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype · ad53f92e
      Joonsoo Kim 提交于
      Before describing bugs itself, I first explain definition of freepage.
      
       1. pages on buddy list are counted as freepage.
       2. pages on isolate migratetype buddy list are *not* counted as freepage.
       3. pages on cma buddy list are counted as CMA freepage, too.
      
      Now, I describe problems and related patch.
      
      Patch 1: There is race conditions on getting pageblock migratetype that
      it results in misplacement of freepages on buddy list, incorrect
      freepage count and un-availability of freepage.
      
      Patch 2: Freepages on pcp list could have stale cached information to
      determine migratetype of buddy list to go.  This causes misplacement of
      freepages on buddy list and incorrect freepage count.
      
      Patch 4: Merging between freepages on different migratetype of
      pageblocks will cause freepages accouting problem.  This patch fixes it.
      
      Without patchset [3], above problem doesn't happens on my CMA allocation
      test, because CMA reserved pages aren't used at all.  So there is no
      chance for above race.
      
      With patchset [3], I did simple CMA allocation test and get below
      result:
      
       - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
       - run kernel build (make -j16) on background
       - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
       - Result: more than 5000 freepage count are missed
      
      With patchset [3] and this patchset, I found that no freepage count are
      missed so that I conclude that problems are solved.
      
      On my simple memory offlining test, these problems also occur on that
      environment, too.
      
      This patch (of 4):
      
      There are two paths to reach core free function of buddy allocator,
      __free_one_page(), one is free_one_page()->__free_one_page() and the
      other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
      Each paths has race condition causing serious problems.  At first, this
      patch is focused on first type of freepath.  And then, following patch
      will solve the problem in second type of freepath.
      
      In the first type of freepath, we got migratetype of freeing page
      without holding the zone lock, so it could be racy.  There are two cases
      of this race.
      
       1. pages are added to isolate buddy list after restoring orignal
          migratetype
      
          CPU1                                   CPU2
      
          get migratetype => return MIGRATE_ISOLATE
          call free_one_page() with MIGRATE_ISOLATE
      
                                      grab the zone lock
                                      unisolate pageblock
                                      release the zone lock
      
          grab the zone lock
          call __free_one_page() with MIGRATE_ISOLATE
          freepage go into isolate buddy list,
          although pageblock is already unisolated
      
      This may cause two problems.  One is that we can't use this page anymore
      until next isolation attempt of this pageblock, because freepage is on
      isolate buddy list.  The other is that freepage accouting could be wrong
      due to merging between different buddy list.  Freepages on isolate buddy
      list aren't counted as freepage, but ones on normal buddy list are
      counted as freepage.  If merge happens, buddy freepage on normal buddy
      list is inevitably moved to isolate buddy list without any consideration
      of freepage accouting so it could be incorrect.
      
       2. pages are added to normal buddy list while pageblock is isolated.
          It is similar with above case.
      
      This also may cause two problems.  One is that we can't keep these
      freepages from being allocated.  Although this pageblock is isolated,
      freepage would be added to normal buddy list so that it could be
      allocated without any restriction.  And the other problem is same as
      case 1, that it, incorrect freepage accouting.
      
      This race condition would be prevented by checking migratetype again
      with holding the zone lock.  Because it is somewhat heavy operation and
      it isn't needed in common case, we want to avoid rechecking as much as
      possible.  So this patch introduce new variable, nr_isolate_pageblock in
      struct zone to check if there is isolated pageblock.  With this, we can
      avoid to re-check migratetype in common case and do it only if there is
      isolated pageblock or migratetype is MIGRATE_ISOLATE.  This solve above
      mentioned problems.
      
      Changes from v3:
      Add one more check in free_one_page() that checks whether migratetype is
      MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad53f92e
    • T
      rhashtable: Drop gfp_flags arg in insert/remove functions · 6eba8224
      Thomas Graf 提交于
      Reallocation is only required for shrinking and expanding and both rely
      on a mutex for synchronization and callers of rhashtable_init() are in
      non atomic context. Therefore, no reason to continue passing allocation
      hints through the API.
      
      Instead, use GFP_KERNEL and add __GFP_NOWARN | __GFP_NORETRY to allow
      for silent fall back to vzalloc() without the OOM killer jumping in as
      pointed out by Eric Dumazet and Eric W. Biederman.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6eba8224
    • M
      net/mlx4_core: Support more than 64 VFs · de966c59
      Matan Barak 提交于
      We now allow up to 126 VFs. Note though that certain firmware
      versions only allow up to 80 VFs. Moreover, old HCAs only support 64 VFs.
      In these cases, we limit the maximum number of VFs to 64.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de966c59
    • M
      net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X vectors for PF/VFs · 7ae0e400
      Matan Barak 提交于
      Previously, the driver queried the firmware in order to get the number
      of supported EQs. Under SRIOV, since this was done before the driver
      notified the firmware how many VFs it actually needs, the firmware had
      to take into account a worst case scenario and always allocated four EQs
      per VF, where one was used for events while the others were used for completions.
      
      Now, when the firmware supports the asymmetric allocation scheme, denoted
      by exposing num_sys_eqs > 0 (--> MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the
      QUERY_FUNC command to query the firmware before enabling SRIOV. Thus we
      can get more EQs and MSI-X vectors per function.
      
      Moreover, when running in the new firmware/driver mode, the limitation
      that the number of EQs should be a power of two is lifted.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae0e400
    • H
      rhashtable: Add parent argument to mutex_is_held · 7b4ce235
      Herbert Xu 提交于
      Currently mutex_is_held can only test locks in the that are global
      since it takes no arguments.  This prevents rhashtable from being
      used in places where locks are lock, e.g., per-namespace locks.
      
      This patch adds a parent field to mutex_is_held and rhashtable_params
      so that local locks can be used (and tested).
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b4ce235
    • H
      rhashtable: Move mutex_is_held under PROVE_LOCKING · 1b2f309d
      Herbert Xu 提交于
      The rhashtable function mutex_is_held is only used when PROVE_LOCKING
      is enabled.  This patch makes the mutex_is_held field in rhashtable
      optional depending on PROVE_LOCKING.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b2f309d
  9. 13 11月, 2014 3 次提交
    • A
      mac802154: add interframe spacing time handling · 61f2dcba
      Alexander Aring 提交于
      This patch adds a new interframe spacing time handling into mac802154
      layer. Interframe spacing time is a time period between each transmit.
      This patch adds a high resolution timer into mac802154 and starts on
      xmit complete with corresponding interframe spacing expire time if
      ifs_handling is true. We make it variable because it depends if
      interframe spacing time is handled by transceiver or mac802154. At the
      timer complete function we wake the netdev queue again. This avoids
      new frame transmit in range of interframe spacing time.
      
      For synced driver we add no handling of interframe spacing time. This
      is currently a lack of support in all synced xmit drivers. I suppose
      it's working because the latency of workqueue which is needed to call
      spi_sync.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      61f2dcba
    • P
      nfs: fix pnfs direct write memory leak · 8c393f9a
      Peng Tao 提交于
      For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets.
      So we need to take care to free them when freeing dreq.
      
      Ideally this needs to be done inside layout driver where ds_cinfo.buckets
      are allocated. But buckets are attached to dreq and reused across LD IO iterations.
      So I feel it's OK to free them in the generic layer.
      
      Cc: stable@vger.kernel.org [v3.4+]
      Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      8c393f9a
    • J
      net: phy: add module_phy_driver macro · c31accd1
      Johan Hovold 提交于
      Add helper macro for PHY drivers which do not do anything special in
      module init/exit. This will allow us to eliminate a lot of boilerplate
      code.
      Signed-off-by: NJohan Hovold <johan@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c31accd1
  10. 12 11月, 2014 4 次提交
    • A
      net: Remove __skb_alloc_page and __skb_alloc_pages · 160d2aba
      Alexander Duyck 提交于
      Remove the two functions which are now dead code.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      160d2aba
    • A
      net: Add device Rx page allocation function · 71dfda58
      Alexander Duyck 提交于
      This patch implements __dev_alloc_pages and __dev_alloc_page.  These are
      meant to replace the __skb_alloc_pages and __skb_alloc_page functions.  The
      reason for doing this is that it occurred to me that __skb_alloc_page is
      supposed to be passed an sk_buff pointer, but it is NULL in all cases where
      it is used.  Worse is that in the case of ixgbe it is passed NULL via the
      sk_buff pointer in the rx_buffer info structure which means the compiler is
      not correctly stripping it out.
      
      The naming for these functions is based on dev_alloc_skb and __dev_alloc_skb.
      There was originally a netdev_alloc_page, however that was passed a
      net_device pointer and this function is not so I thought it best to follow
      that naming scheme since that is the same difference between dev_alloc_skb
      and netdev_alloc_skb.
      
      In the case of anything greater than order 0 it is assumed that we want a
      compound page so __GFP_COMP is set for all allocations as we expect a
      compound page when assigning a page frag.
      
      The other change in this patch is to exploit the behaviors of the page
      allocator in how it handles flags.  So for example we can always set
      __GFP_COMP and __GFP_MEMALLOC since they are ignored if they are not
      applicable or are overridden by another flag.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71dfda58
    • A
      ieee820154: add short_addr setting support · 9830c62a
      Alexander Aring 提交于
      This patch adds support for setting short address via nl802154 framework.
      Also added a comment because a 0xffff seems to be valid address that we
      don't have a short address. This is a valid setting but we need
      more checks in upper layers to don't allow this address as source address.
      Also the current netlink interface doesn't allow to set the short_addr
      to 0xffff. Same for the 0xfffe short address which describes a not
      allocated short address.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      9830c62a
    • A
      ieee820154: add pan_id setting support · 702bf371
      Alexander Aring 提交于
      This patch adds support for setting pan_id via nl802154 framework.
      Adding a comment because setting 0xffff as pan_id seems to be valid
      setting. The pan_id 0xffff as source pan is invalid. I am not sure now
      about this setting but for the current netlink interface this is an
      invalid setting, so we do the same now. Maybe we need to change that
      when we have coordinator support and association support.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      702bf371