1. 01 10月, 2012 8 次提交
    • J
      mlx4: Paravirtualize Node Guids for slaves · afa8fd1d
      Jack Morgenstein 提交于
      This is necessary in order to support > 1 VF/PF in a VM for software
      that uses the node guid as a discriminator, such as librdmacm.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      afa8fd1d
    • J
      mlx4: Activate SR-IOV mode for IB · 026149cb
      Jack Morgenstein 提交于
      Remove the error returns for IB ports from mlx4_ib_add,
      mlx4_INIT_PORT_wrapper, and mlx4_CLOSE_PORT_wrapper.
      
      Currently, SRIOV is supported only for devices for which the
      link layer is IB on all ports; RoCE support will be added later.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      026149cb
    • J
      net/mlx4_core: Adjustments to SET_PORT for IB SR-IOV · efcd235d
      Jack Morgenstein 提交于
      1. Slaves may not set the IS_SM capability for the port.
      2. DEV_MGMT may not be set in multifunction mode.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      efcd235d
    • J
      mlx4_core: Add IB port-state machine and port mgmt event propagation · 993c401e
      Jack Morgenstein 提交于
      For an IB port, a slave should not show port active until that slave
      has a valid alias-guid (provided by the subnet manager).  Therefore
      the port-up event should be passed to a slave only after both the port
      is up, and the slave's alias-guid has been set.
      
      Also, provide the infrastructure for propagating port-management
      events (client-reregister, etc) to slaves.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      993c401e
    • J
      mlx4: Implement QP paravirtualization and maintain phys_pkey_cache for smp_snoop · 54679e14
      Jack Morgenstein 提交于
      This requires:
      
      1. Replacing the paravirtualized P_Key index (inserted by the guest)
         with the real P_Key index.
      
      2. For UD QPs, placing the guest's true source GID index in the
         address path structure mgid field, and setting the ud_force_mgid
         bit so that the mgid is taken from the QP context and not from the
         WQE when posting sends.
      
      3. For UC and RC QPs, placing the guest's true source GID index in the
         address path structure mgid field.
      
      4. For tunnel and proxy QPs, setting the Q_Key value reserved for that
         proxy/tunnel pair.
      
      Since not all the above adjustments occur in all the QP transitions,
      the QP transitions require separate wrapper functions.
      
      Secondly, initialize the P_Key virtualization table to its default
      values: Master virtualized table is 1-1 with the real P_Key table,
      guest virtualized table has P_Key index 0 mapped to the real P_Key
      index 0, and all the other P_Key indices mapped to the reserved
      (invalid) P_Key at index 127.
      
      Finally, add logic in smp_snoop for maintaining the phys_P_Key_cache.
      and generating events on the master only if a P_Key actually changed.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      54679e14
    • J
      IB/mlx4: Initialize SR-IOV IB support for slaves in master context · fc06573d
      Jack Morgenstein 提交于
      Allocate SR-IOV paravirtualization resources and MAD demuxing contexts
      on the master.
      
      This has two parts.  The first part is to initialize the structures to
      contain the contexts.  This is done at master startup time in
      mlx4_ib_init_sriov().
      
      The second part is to actually create the tunneling resources required
      on the master to support a slave.  This is performed the master
      detects that a slave has started up (MLX4_DEV_EVENT_SLAVE_INIT event
      generated when a slave initializes its comm channel).
      
      For the master, there is no such startup event, so it creates its own
      tunneling resources when it starts up.  In addition, the master also
      creates the real special QPs.  The ib_core layer on the master causes
      creation of proxy special QPs, since the master is also
      paravirtualized at the ib_core layer.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      fc06573d
    • J
      mlx4_core: Add proxy and tunnel QPs to the reserved QP area · e2c76824
      Jack Morgenstein 提交于
      In addition, pass the proxy and tunnel QP numbers to slaves so the
      driver can perform special QP paravirtualization.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      e2c76824
    • J
      IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support · 1ffeb2eb
      Jack Morgenstein 提交于
      1. Introduce the basic SR-IOV parvirtualization context objects for
         multiplexing and demultiplexing MADs.
      2. Introduce support for the new proxy and tunnel QP types.
      
      This patch introduces the objects required by the master for managing
      QP paravirtualization for guests.
      
      struct mlx4_ib_sriov is created by the master only.
      It is a container for the following:
      
      1. All the info required by the PPF to multiplex and de-multiplex MADs
         (including those from the PF). (struct mlx4_ib_demux_ctx demux)
      2. All the info required to manage alias GUIDs (i.e., the GUID at
         index 0 that each guest perceives.  In fact, this is not the GUID
         which is actually at index 0, but is, in fact, the GUID which is at
         index[<VF number>] in the physical table.
      3. structures which are used to manage CM paravirtualization
      4. structures for managing the real special QPs when running in SR-IOV
         mode.  The real SQPs are controlled by the PPF in this case.  All
         SQPs created and controlled by the ib core layer are proxy SQP.
      
      struct mlx4_ib_demux_ctx contains the information per port needed
      to manage paravirtualization:
      
      1. All multicast paravirt info
      2. All tunnel-qp paravirt info for the port.
      3. GUID-table and GUID-prefix for the port
      4. work queues.
      
      struct mlx4_ib_demux_pv_ctx contains all the info for managing the
      paravirtualized QPs for one slave/port.
      
      struct mlx4_ib_demux_pv_qp contains the info need to run an individual
      QP (either tunnel qp or real SQP).
      
      Note:  We made use of the 2 most significant bits in enum
      mlx4_ib_qp_flags (based on enum ib_qp_create_flags in ib_verbs.h).
      We need these bits in the low-level driver for internal purposes.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      1ffeb2eb
  2. 21 9月, 2012 1 次提交
    • M
      xfrm_user: ensure user supplied esn replay window is valid · ecd79187
      Mathias Krause 提交于
      The current code fails to ensure that the netlink message actually
      contains as many bytes as the header indicates. If a user creates a new
      state or updates an existing one but does not supply the bytes for the
      whole ESN replay window, the kernel copies random heap bytes into the
      replay bitmap, the ones happen to follow the XFRMA_REPLAY_ESN_VAL
      netlink attribute. This leads to following issues:
      
      1. The replay window has random bits set confusing the replay handling
         code later on.
      
      2. A malicious user could use this flaw to leak up to ~3.5kB of heap
         memory when she has access to the XFRM netlink interface (requires
         CAP_NET_ADMIN).
      
      Known users of the ESN replay window are strongSwan and Steffen's
      iproute2 patch (<http://patchwork.ozlabs.org/patch/85962/>). The latter
      uses the interface with a bitmap supplied while the former does not.
      strongSwan is therefore prone to run into issue 1.
      
      To fix both issues without breaking existing userland allow using the
      XFRMA_REPLAY_ESN_VAL netlink attribute with either an empty bitmap or a
      fully specified one. For the former case we initialize the in-kernel
      bitmap with zero, for the latter we copy the user supplied bitmap. For
      state updates the full bitmap must be supplied.
      
      To prevent overflows in the bitmap length calculation the maximum size
      of bmp_len is limited to 128 by this patch -- resulting in a maximum
      replay window of 4096 packets. This should be sufficient for all real
      life scenarios (RFC 4303 recommends a default replay window size of 64).
      
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Martin Willi <martin@revosec.ch>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecd79187
  3. 19 9月, 2012 2 次提交
    • G
      linux/kernel.h: Fix warning seen with W=1 due to change in DIV_ROUND_CLOSEST · 263a523d
      Guenter Roeck 提交于
      After commit b6d86d3d (Fix DIV_ROUND_CLOSEST to support negative dividends),
      the following warning is seen if the kernel is compiled with W=1 (-Wextra):
      
      warning: comparison of unsigned expression >= 0 is always true
      
      The warning is due to the test '((typeof(x))-1) >= 0', which is used to detect
      if the variable type is unsigned. Research on the web suggests that the warning
      disappears if '>' instead of '>=' is used for the comparison.
      
      Tests after changing the macro along that line show that the warning is gone,
      and that the result is still correct:
      
      i=-4: DIV_ROUND_CLOSEST(i, 2)=-2
      i=-3: DIV_ROUND_CLOSEST(i, 2)=-2
      i=-2: DIV_ROUND_CLOSEST(i, 2)=-1
      i=-1: DIV_ROUND_CLOSEST(i, 2)=-1
      i=0: DIV_ROUND_CLOSEST(i, 2)=0
      i=1: DIV_ROUND_CLOSEST(i, 2)=1
      i=2: DIV_ROUND_CLOSEST(i, 2)=1
      i=3: DIV_ROUND_CLOSEST(i, 2)=2
      i=4: DIV_ROUND_CLOSEST(i, 2)=2
      
      Code size is the same as before.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      Acked-by: NJean Delvare <khali@linux-fr.org>
      263a523d
    • M
      vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill() · b161dfa6
      Miklos Szeredi 提交于
      IBM reported a soft lockup after applying the fix for the rename_lock
      deadlock.  Commit c83ce989 ("VFS: Fix the nfs sillyrename regression
      in kernel 2.6.38") was found to be the culprit.
      
      The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
      dentry was killed.  This flag can be set on non-killed dentries too,
      which results in infinite retries when trying to traverse the dentry
      tree.
      
      This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
      only set in d_kill() and makes try_to_ascend() test only this flag.
      
      IBM reported successful test results with this patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b161dfa6
  4. 18 9月, 2012 2 次提交
    • A
      compiler.h: add __visible · 9a858dc7
      Andi Kleen 提交于
      gcc 4.6+ has support for a externally_visible attribute that prevents the
      optimizer from optimizing unused symbols away.  Add a __visible macro to
      use it with that compiler version or later.
      
      This is used (at least) by the "Link Time Optimization" patchset.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a858dc7
    • J
      mm/ia64: fix a memory block size bug · 05cf9639
      Jianguo Wu 提交于
      I found following definition in include/linux/memory.h, in my IA64
      platform, SECTION_SIZE_BITS is equal to 32, and MIN_MEMORY_BLOCK_SIZE
      will be 0.
      
        #define MIN_MEMORY_BLOCK_SIZE     (1 << SECTION_SIZE_BITS)
      
      Because MIN_MEMORY_BLOCK_SIZE is int type and length of 32bits,
      so MIN_MEMORY_BLOCK_SIZE(1 << 32) will will equal to 0.
      Actually when SECTION_SIZE_BITS >= 31, MIN_MEMORY_BLOCK_SIZE will be wrong.
      This will cause wrong system memory infomation in sysfs.
      I think it should be:
      
        #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
      
      And "echo offline > memory0/state" will cause following call trace:
      
        kernel BUG at mm/memory_hotplug.c:885!
        sh[6455]: bugcheck! 0 [1]
        Pid: 6455, CPU 0, comm:                   sh
        psr : 0000101008526030 ifs : 8000000000000fa4 ip  : [<a0000001008c40f0>]    Not tainted (3.6.0-rc1)
        ip is at offline_pages+0x210/0xee0
        Call Trace:
          show_stack+0x80/0xa0
          show_regs+0x640/0x920
          die+0x190/0x2c0
          die_if_kernel+0x50/0x80
          ia64_bad_break+0x3d0/0x6e0
          ia64_native_leave_kernel+0x0/0x270
          offline_pages+0x210/0xee0
          alloc_pages_current+0x180/0x2a0
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05cf9639
  5. 17 9月, 2012 1 次提交
  6. 16 9月, 2012 1 次提交
    • M
      mfd: core: Push irqdomain mapping out into devices · 0848c94f
      Mark Brown 提交于
      Currently the MFD core supports remapping MFD cell interrupts using an
      irqdomain but only if the MFD is being instantiated using device tree
      and only if the device tree bindings use the pattern of registering IPs
      in the device tree with compatible properties.  This will be actively
      harmful for drivers which support non-DT platforms and use this pattern
      for their DT bindings as it will mean that the core will silently change
      remapping behaviour and it is also limiting for drivers which don't do
      DT with this particular pattern.  There is also a potential fragility if
      there are interrupts not associated with MFD cells and all the cells are
      omitted from the device tree for some reason.
      
      Instead change the code to take an IRQ domain as an optional argument,
      allowing drivers to take the decision about the parent domain for their
      interrupts.  The one current user of this feature is ab8500-core, it has
      the domain lookup pushed out into the driver.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      0848c94f
  7. 14 9月, 2012 1 次提交
  8. 13 9月, 2012 1 次提交
  9. 12 9月, 2012 1 次提交
    • R
      i2c: pnx: Fix read transactions of >= 2 bytes · c076ada4
      Roland Stigge 提交于
      On transactions with n>=2 bytes, the controller actually wrongly clocks in n+1
      bytes. This is caused by the (wrong) assumption that RFE in the Status Register
      is 1 iff there is no byte already ordered (via a dummy TX byte). This lead to
      the implementation of synchronized byte ordering, e.g.:
      
      Dummy-TX - RX - Dummy-TX - RX - ...
      
      But since RFE actually stays high after some Dummy-TX, it rather looks like:
      
      Dummy-TX - Dummy-TX - RX - Dummy-TX - RX - (RX)
      
      The last RX byte is clocked in by the bus controller, but ignored by the kernel
      when filling the userspace buffer.
      
      This patch fixes the issue by asking for RX via Dummy-TX asynchronously.
      Introducing a separate counter for TX bytes.
      Signed-off-by: NRoland Stigge <stigge@antcom.de>
      Signed-off-by: NWolfram Sang <w.sang@pengutronix.de>
      c076ada4
  10. 08 9月, 2012 2 次提交
  11. 07 9月, 2012 2 次提交
    • T
      SUNRPC: Fix a UDP transport regression · f39c1bfb
      Trond Myklebust 提交于
      Commit 43cedbf0 (SUNRPC: Ensure that
      we grab the XPRT_LOCK before calling xprt_alloc_slot) is causing
      hangs in the case of NFS over UDP mounts.
      
      Since neither the UDP or the RDMA transport mechanism use dynamic slot
      allocation, we can skip grabbing the socket lock for those transports.
      Add a new rpc_xprt_op to allow switching between the TCP and UDP/RDMA
      case.
      
      Note that the NFSv4.1 back channel assigns the slot directly
      through rpc_run_bc_task, so we can ignore that case.
      Reported-by: NDick Streefland <dick.streefland@altium.nl>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org [>= 3.1]
      f39c1bfb
    • B
      kobject: fix oops with "input0: bad kobj_uevent_env content in show_uevent()" · 60e233a5
      Bjørn Mork 提交于
      Fengguang Wu <fengguang.wu@intel.com> writes:
      
      > After the __devinit* removal series, I can still get kernel panic in
      > show_uevent(). So there are more sources of bug..
      >
      > Debug patch:
      >
      > @@ -343,8 +343,11 @@ static ssize_t show_uevent(struct device
      >                 goto out;
      >
      >         /* copy keys to file */
      > -       for (i = 0; i < env->envp_idx; i++)
      > +       dev_err(dev, "uevent %d env[%d]: %s/.../%s\n", env->buflen, env->envp_idx, top_kobj->name, dev->kobj.name);
      > +       for (i = 0; i < env->envp_idx; i++) {
      > +               printk(KERN_ERR "uevent %d env[%d]: %s\n", (int)count, i, env->envp[i]);
      >                 count += sprintf(&buf[count], "%s\n", env->envp[i]);
      > +       }
      >
      > Oops message, the env[] is again not properly initilized:
      >
      > [   44.068623] input input0: uevent 61 env[805306368]: input0/.../input0
      > [   44.069552] uevent 0 env[0]: (null)
      
      This is a completely different CONFIG_HOTPLUG problem, only
      demonstrating another reason why CONFIG_HOTPLUG should go away.  I had a
      hard time trying to disable it anyway ;-)
      
      The problem this time is lots of code assuming that a call to
      add_uevent_var() will guarantee that env->buflen > 0.  This is not true
      if CONFIG_HOTPLUG is unset.  So things like this end up overwriting
      env->envp_idx because the array index is -1:
      
      	if (add_uevent_var(env, "MODALIAS="))
      		return -ENOMEM;
              len = input_print_modalias(&env->buf[env->buflen - 1],
      				   sizeof(env->buf) - env->buflen,
      				   dev, 0);
      
      Don't know what the best action is, given that there seem to be a *lot*
      of this around the kernel.  This patch "fixes" the problem for me, but I
      don't know if it can be considered an appropriate fix.
      
      [ It is the correct fix for now, for 3.7 forcing CONFIG_HOTPLUG to
      always be on is the longterm fix, but it's too late for 3.6 and older
      kernels to resolve this that way - gregkh ]
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60e233a5
  12. 06 9月, 2012 1 次提交
  13. 05 9月, 2012 2 次提交
  14. 04 9月, 2012 2 次提交
  15. 02 9月, 2012 2 次提交
  16. 31 8月, 2012 1 次提交
  17. 29 8月, 2012 1 次提交
  18. 23 8月, 2012 1 次提交
    • A
      ARM: omap: allow building omap44xx without SMP · c7a9b09b
      Arnd Bergmann 提交于
      The new omap4 cpuidle implementation currently requires
      ARCH_NEEDS_CPU_IDLE_COUPLED, which only works on SMP.
      
      This patch makes it possible to build a non-SMP kernel
      for that platform. This is not normally desired for
      end-users but can be useful for testing.
      
      Without this patch, building rand-0y2jSKT results in:
      
      drivers/cpuidle/coupled.c: In function 'cpuidle_coupled_poke':
      drivers/cpuidle/coupled.c:317:3: error: implicit declaration of function '__smp_call_function_single' [-Werror=implicit-function-declaration]
      
      It's not clear if this patch is the best solution for
      the problem at hand. I have made sure that we can now
      build the kernel in all configurations, but that does
      not mean it will actually work on an OMAP44xx.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Tested-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Tony Lindgren <tony@atomide.com>
      c7a9b09b
  19. 22 8月, 2012 4 次提交
    • A
      introduce kref_put_mutex() · 8ad5db8a
      Al Viro 提交于
      equivalent of
      	mutex_lock(mutex);
      	if (!kref_put(kref, release))
      		mutex_unlock(mutex);
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ad5db8a
    • A
      mfd: Move tps65217 regulator plat data handling to regulator · 1922b0f2
      AnilKumar Ch 提交于
      Regulator platform data handling was mistakenly added to MFD
      driver. So we will see build errors if we compile MFD drivers
      without CONFIG_REGULATOR. This patch moves regulator platform
      data handling from TPS65217 MFD driver to regulator driver.
      
      This makes MFD driver independent of REGULATOR framework so
      build error is fixed if CONFIG_REGULATOR is not set.
      
      drivers/built-in.o: In function `tps65217_probe':
      tps65217.c:(.devinit.text+0x13e37): undefined reference
      to `of_regulator_match'
      
      This patch also fix allocation size of tps65217 platform data.
      Current implementation allocates a struct tps65217_board for each
      regulator specified in the device tree. But the structure itself
      provides array of regulators so one instance of it is sufficient.
      Signed-off-by: NAnilKumar Ch <anilkumar@ti.com>
      1922b0f2
    • M
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman 提交于
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • W
      string: do not export memweight() to userspace · c3a5ce04
      WANG Cong 提交于
      Fix the following warning:
      
        usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3a5ce04
  20. 20 8月, 2012 2 次提交
  21. 17 8月, 2012 2 次提交