1. 21 7月, 2011 7 次提交
  2. 19 7月, 2011 1 次提交
  3. 15 7月, 2011 1 次提交
    • M
      net: remove NETIF_F_ALL_TX_OFFLOADS · 62f2a3a4
      Michał Mirosław 提交于
      There is no software fallback implemented for SCTP or FCoE checksumming,
      and so it should not be passed on by software devices like bridge or bonding.
      
      For VLAN devices, this is different. First, the driver for underlying device
      should be prepared to get offloaded packets even when the feature is disabled
      (especially if it advertises it in vlan_features). Second, devices under
      VLANs do not get replaced without tearing down the VLAN first.
      
      This fixes a mess I accidentally introduced while converting bonding to
      ndo_fix_features.
      
      NETIF_F_SOFT_FEATURES are removed from BOND_VLAN_FEATURES because they
      are unused as of commit 712ae51a.
      Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62f2a3a4
  4. 14 7月, 2011 1 次提交
  5. 12 7月, 2011 1 次提交
  6. 09 7月, 2011 1 次提交
    • J
      w1: ds1wm: add a reset recovery parameter · f607e7fc
      Jean-François Dagenais 提交于
      This fixes a regression in 3.0 reported by Paul Parsons regarding the
      removal of the msleep(1) in the ds1wm_reset() function:
      
      : The linux-3.0-rc4 DS1WM 1-wire driver is logging "bus error, retrying"
      : error messages on an HP iPAQ hx4700 PDA (XScale-PXA270):
      :
      : <snip>
      : Driver for 1-wire Dallas network protocol.
      : DS1WM w1 busmaster driver - (c) 2004 Szabolcs Gyurko
      : 1-Wire driver for the DS2760 battery monitor  chip  - (c) 2004-2005, Szabolcs Gyurko
      : ds1wm ds1wm: pass: 1 bus error, retrying
      : ds1wm ds1wm: pass: 2 bus error, retrying
      : ds1wm ds1wm: pass: 3 bus error, retrying
      : ds1wm ds1wm: pass: 4 bus error, retrying
      : ds1wm ds1wm: pass: 5 bus error, retrying
      : ...
      :
      : The visible result is that the battery charging LED is erratic; sometimes
      : it works, mostly it doesn't.
      :
      : The linux-2.6.39 DS1WM 1-wire driver worked OK.  I haven't tried 3.0-rc1,
      : 3.0-rc2, or 3.0-rc3.
      
      This sleep should not be required on normal circuitry provided the
      pull-ups on the bus are correctly adapted to the slaves.  Unfortunately,
      this is not always the case.  The sleep is restored but as a parameter to
      the probe function in the pdata.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NPaul Parsons <lost.distance@yahoo.com>
      Tested-by: NPaul Parsons <lost.distance@yahoo.com>
      Signed-off-by: NJean-François Dagenais <dagenaisj@sonatest.com>
      Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f607e7fc
  7. 08 7月, 2011 2 次提交
    • D
      FS-Cache: Add a helper to bulk uncache pages on an inode · c902ce1b
      David Howells 提交于
      Add an FS-Cache helper to bulk uncache pages on an inode.  This will
      only work for the circumstance where the pages in the cache correspond
      1:1 with the pages attached to an inode's page cache.
      
      This is required for CIFS and NFS: When disabling inode cookie, we were
      returning the cookie and setting cifsi->fscache to NULL but failed to
      invalidate any previously mapped pages.  This resulted in "Bad page
      state" errors and manifested in other kind of errors when running
      fsstress.  Fix it by uncaching mapped pages when we disable the inode
      cookie.
      
      This patch should fix the following oops and "Bad page state" errors
      seen during fsstress testing.
      
        ------------[ cut here ]------------
        kernel BUG at fs/cachefiles/namei.c:201!
        invalid opcode: 0000 [#1] SMP
        Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
        RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
        RSP: 0018:ffff88002ce6dd00  EFLAGS: 00010282
        RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
        RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
        R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
        R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
        FS:  00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
        Stack:
         0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
         ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
         ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
        Call Trace:
         cachefiles_lookup_object+0x78/0xd4 [cachefiles]
         fscache_lookup_object+0x131/0x16d [fscache]
         fscache_object_work_func+0x1bc/0x669 [fscache]
         process_one_work+0x186/0x298
         worker_thread+0xda/0x15d
         kthread+0x84/0x8c
         kernel_thread_helper+0x4/0x10
        RIP  cachefiles_walk_to_object+0x436/0x745 [cachefiles]
        ---[ end trace 1d481c9af1804caa ]---
      
      I tested the uncaching by the following means:
      
       (1) Create a big file on my NFS server (104857600 bytes).
      
       (2) Read the file into the cache with md5sum on the NFS client.  Look in
           /proc/fs/fscache/stats:
      
      	Pages  : mrk=25601 unc=0
      
       (3) Open the file for read/write ("bash 5<>/warthog/bigfile").  Look in proc
           again:
      
      	Pages  : mrk=25601 unc=25601
      Reported-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-Tested-by: NSuresh Jayaraman <sjayaraman@suse.de>
      cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c902ce1b
    • S
      genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) · 659fb32d
      Simon Guinot 提交于
      This fixes a regression introduced by e59347a1 "arm: orion:
      Use generic irq chip".
      
      Depending on the device, interrupts acknowledgement is done by setting
      or by clearing a dedicated register. Replace irq_gc_ack() with some
      {set,clr}_bit variants allows to handle both cases.
      
      Note that this patch affects the following SoCs: Davinci, Samsung and
      Orion. Except for this last, the change is minor: irq_gc_ack() is just
      renamed into irq_gc_ack_set_bit().
      
      For the Orion SoCs, the edge GPIO interrupts support is currently
      broken. irq_gc_ack() try to acknowledge a such interrupt by setting
      the corresponding cause register bit. The Orion GPIO device expect the
      opposite. To fix this issue, the irq_gc_ack_clr_bit() variant is used.
      
      Tested on Network Space v2.
      Reported-by: NJoey Oravec <joravec@drewtech.com>
      Signed-off-by: NSimon Guinot <sguinot@lacie.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      659fb32d
  8. 05 7月, 2011 1 次提交
  9. 30 6月, 2011 1 次提交
  10. 28 6月, 2011 7 次提交
    • J
      mm: fix assertion mapping->nrpages == 0 in end_writeback() · 08142579
      Jan Kara 提交于
      Under heavy memory and filesystem load, users observe the assertion
      mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
      page reclaim reclaiming the last page from a mapping in the following
      race:
      
      	CPU0				CPU1
        ...
        shrink_page_list()
          __remove_mapping()
            __delete_from_page_cache()
              radix_tree_delete()
      					evict_inode()
      					  truncate_inode_pages()
      					    truncate_inode_pages_range()
      					      pagevec_lookup() - finds nothing
      					  end_writeback()
      					    mapping->nrpages != 0 -> BUG
              page->mapping = NULL
              mapping->nrpages--
      
      Fix the problem by doing a reliable check of mapping->nrpages under
      mapping->tree_lock in end_writeback().
      
      Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
      by Miklos Szeredi <mszeredi@suse.de>.
      
      Cc: Jay <jinshan.xiong@whamcloud.com>
      Cc: Miklos Szeredi <mszeredi@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08142579
    • C
      include/linux/compat.h: declare compat_sys_sendmmsg() · 507c5f12
      Chris Metcalf 提交于
      This is required for tilegx to be able to use the compat unistd.h header
      where compat_sys_sendmmsg() is now mentioned.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      507c5f12
    • H
      tmpfs: add shmem_read_mapping_page_gfp · d9d90e5e
      Hugh Dickins 提交于
      Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
      is unsuited to tmpfs, because it inserts a page into pagecache before
      calling the filesystem's ->readpage: tmpfs may have pages in swapcache
      which only it knows how to locate and switch to filecache.
      
      At present tmpfs provides a ->readpage method, and copes with this by
      copying pages; but soon we can simplify it by removing its ->readpage.
      Provide shmem_read_mapping_page_gfp() now, ready for that transition,
      
      Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
      with shmem_read_mapping_page() inline for the common mapping_gfp case.
      
      (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
      read_mapping_page functions use the mapping's ->readpage, and the
      read_cache_page functions use the supplied filler, so I think
      read_cache_page_gfp was slightly misnamed.)
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9d90e5e
    • H
      tmpfs: take control of its truncate_range · 94c1e62d
      Hugh Dickins 提交于
      2.6.35's new truncate convention gave tmpfs the opportunity to control
      its file truncation, no longer enforced from outside by vmtruncate().
      We shall want to build upon that, to handle pagecache and swap together.
      
      Slightly redefine the ->truncate_range interface: let it now be called
      between the unmap_mapping_range()s, with the filesystem responsible for
      doing the truncate_inode_pages_range() from it - just as the filesystem
      is nowadays responsible for doing that from its ->setattr.
      
      Let's rename shmem_notify_change() to shmem_setattr().  Instead of
      calling the generic truncate_setsize(), bring that code in so we can
      call shmem_truncate_range() - which will later be updated to perform its
      own variant of truncate_inode_pages_range().
      
      Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
      now that the COW's unmap_mapping_range() comes after ->truncate_range,
      there is no need to call it a third time.
      
      Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
      that i915_gem_object_truncate() can call it explicitly in future; get
      this patch in first, then update drm/i915 once this is available (until
      then, i915 will just be doing the truncate_inode_pages() twice).
      
      Though introduced five years ago, no other filesystem is implementing
      ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
      expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
      whereupon ->truncate_range can be removed from inode_operations -
      shmem_truncate_range() will help i915 across that transition too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94c1e62d
    • H
      mm: move shmem prototypes to shmem_fs.h · 072441e2
      Hugh Dickins 提交于
      Before adding any more global entry points into shmem.c, gather such
      prototypes into shmem_fs.h.  Remove mm's own declarations from swap.h,
      but for now leave the ones in mm.h: because shmem_file_setup() and
      shmem_zero_setup() are called from various places, and we should not
      force other subsystems to update immediately.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072441e2
    • V
      Fix some kernel-doc warnings · 4d258b25
      Vitaliy Ivanov 提交于
      Fix 'make htmldocs' warnings:
      
        Warning(/include/linux/hrtimer.h:153): No description found for parameter 'clockid'
        Warning(/include/linux/device.h:604): Excess struct/union/enum/typedef member 'of_match' description in 'device'
        Warning(/include/net/sock.h:349): Excess struct/union/enum/typedef member 'sk_rmem_alloc' description in 'sock'
      Signed-off-by: NVitaliy Ivanov <vitalivanov@gmail.com>
      Acked-by: NGrant Likely <grant.likely@secretlab.ca>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d258b25
    • K
      Fix node_start/end_pfn() definition for mm/page_cgroup.c · c6830c22
      KAMEZAWA Hiroyuki 提交于
      commit 21a3c964 uses node_start/end_pfn(nid) for detection start/end
      of nodes. But, it's not defined in linux/mmzone.h but defined in
      /arch/???/include/mmzone.h which is included only under
      CONFIG_NEED_MULTIPLE_NODES=y.
      
      Then, we see
        mm/page_cgroup.c: In function 'page_cgroup_init':
        mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
        mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'
      
      So, fixiing page_cgroup.c is an idea...
      
      But node_start_pfn()/node_end_pfn() is a very generic macro and
      should be implemented in the same manner for all archs.
      (m32r has different implementation...)
      
      This patch removes definitions of node_start/end_pfn() in each archs
      and defines a unified one in linux/mmzone.h. It's not under
      CONFIG_NEED_MULTIPLE_NODES, now.
      
      A result of macro expansion is here (mm/page_cgroup.c)
      
      for !NUMA
       start_pfn = ((&contig_page_data)->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      for NUMA (x86-64)
        start_pfn = ((node_data[nid])->node_start_pfn);
        end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});
      
      Changelog:
       - fixed to avoid using "nid" twice in node_end_pfn() macro.
      Reported-and-acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Reported-and-tested-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6830c22
  11. 24 6月, 2011 1 次提交
    • T
      fsl-diu-fb: remove check for pixel clock ranges · 39785eb1
      Timur Tabi 提交于
      The Freescale DIU framebuffer driver defines two constants, MIN_PIX_CLK and
      MAX_PIX_CLK, that are supposed to represent the lower and upper limits of
      the pixel clock.  These values, however, are true only for one platform
      clock rate (533MHz) and only for the MPC8610.  So the actual range for
      the pixel clock is chip-specific, which means the current values are almost
      always wrong.  The chance of an out-of-range pixel clock being used are also
      remote.
      
      Rather than try to detect an out-of-range clock in the DIU driver, we depend
      on the board-specific pixel clock function (e.g. p1022ds_set_pixel_clock)
      to clamp the pixel clock to a supported value.
      Signed-off-by: NTimur Tabi <timur@freescale.com>
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      39785eb1
  12. 22 6月, 2011 2 次提交
    • A
      PM: Fix async resume following suspend failure · 6d0e0e84
      Alan Stern 提交于
      The PM core doesn't handle suspend failures correctly when it comes to
      asynchronously suspended devices.  These devices are moved onto the
      dpm_suspended_list as soon as the corresponding async thread is
      started up, and they remain on the list even if they fail to suspend
      or the sleep transition is cancelled before they get suspended.  As a
      result, when the PM core unwinds the transition, it tries to resume
      the devices even though they were never suspended.
      
      This patch (as1474) fixes the problem by adding a new "is_suspended"
      flag to dev_pm_info.  Devices are resumed only if the flag is set.
      
      [rjw:
       * Moved the dev->power.is_suspended check into device_resume(),
         because we need to complete dev->power.completion and clear
         dev->power.is_prepared too for devices whose
         dev->power.is_suspended flags are unset.
       * Fixed __device_suspend() to avoid setting dev->power.is_suspended
         if async_error is different from zero.]
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@kernel.org
      6d0e0e84
    • A
      PM: Rename dev_pm_info.in_suspend to is_prepared · f76b168b
      Alan Stern 提交于
      This patch (as1473) renames the "in_suspend" field in struct
      dev_pm_info to "is_prepared", in preparation for an upcoming change.
      The new name is more descriptive of what the field really means.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@kernel.org
      f76b168b
  13. 21 6月, 2011 2 次提交
    • L
      vfs: i_state needs to be 'unsigned long' for now · 79568f5b
      Linus Torvalds 提交于
      Commit 13e12d14 ("vfs: reorganize 'struct inode' layout a bit")
      moved things around a bit changed i_state to be unsigned int instead of
      unsigned long.  That was to help structure layout for the 64-bit case,
      and shrink 'struct inode' a bit (admittedly that only happened when
      spinlock debugging was on and i_flags didn't pack with i_lock).
      
      However, Meelis Roos reports that this results in unaligned exceptions
      on sprc, and it turns out that the bit-locking primitives that we use
      for the I_NEW bit want to use the bitops.  Which want 'unsigned long',
      not 'unsigned int'.
      
      We really should fix the bit locking code to not have that kind of
      requirement, but that's a much bigger change.  So for now, revert that
      field back to 'unsigned long' (but keep the other re-ordering changes
      from the commit that caused this).
      
      Andi points out that we have played games with this in 'struct page', so
      it's solvable with other hacks too, but since right now the struct inode
      size advantage only happens with some rare config options, it's not
      worth fighting.
      
      It _would_ be worth fixing the bitlocking code, though.  Especially
      since there is no type safety in the bitlocking code (this never caused
      any warnings, and worked fine on x86-64, because the bitlocks take a
      'void *' and x86-64 doesn't care that deeply about alignment).  So it's
      currently a very easy problem to trigger by mistake and never notice.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79568f5b
    • B
      NFSv4.1: file layout must consider pg_bsize for coalescing · 19345cb2
      Benny Halevy 提交于
      Otherwise we end up overflowing the rpc buffer size on the receive end.
      Signed-off-by: NBenny Halevy <benny@tonian.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      19345cb2
  14. 20 6月, 2011 2 次提交
  15. 19 6月, 2011 1 次提交
  16. 18 6月, 2011 2 次提交
  17. 17 6月, 2011 3 次提交
  18. 16 6月, 2011 4 次提交
    • R
      gpio: add GPIOF_ values regardless on kconfig settings · c001fb72
      Randy Dunlap 提交于
      Make GPIOF_ defined values available even when GPIOLIB nor GENERIC_GPIO
      is enabled by moving them to <linux/gpio.h>.
      
      Fixes these build errors in linux-next:
      sound/soc/codecs/ak4641.c:524: error: 'GPIOF_OUT_INIT_LOW' undeclared (first use in this function)
      sound/soc/codecs/wm8915.c:2921: error: 'GPIOF_OUT_INIT_LOW' undeclared (first use in this function)
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      c001fb72
    • J
      uts: make default hostname configurable, rather than always using "(none)" · bd5dc17b
      Josh Triplett 提交于
      The "hostname" tool falls back to setting the hostname to "localhost" if
      /etc/hostname does not exist.  Distribution init scripts have the same
      fallback.  However, if userspace never calls sethostname, such as when
      booting with init=/bin/sh, or otherwise booting a minimal system without
      the usual init scripts, the default hostname of "(none)" remains,
      unhelpfully appearing in various places such as prompts ("root@(none):~#")
      and logs.  Furthermore, "(none)" doesn't typically resolve to anything
      useful.
      
      Make the default hostname configurable.  This removes the need for the
      standard fallback, provides a useful default for systems that never call
      sethostname, and makes minimal systems that much more useful with less
      configuration.  Distributions could choose to use "localhost" here to
      avoid the fallback, while embedded systems may wish to use a specific
      target hostname.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Kel Modderman <kel@otaku42.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd5dc17b
    • D
      BUILD_BUG_ON_ZERO: fix sparse breakage · ca39599c
      Dr. David Alan Gilbert 提交于
      BUILD_BUG_ON_ZERO and BUILD_BUG_ON_NULL must return values, even in the
      CHECKER case otherwise various users of it become syntactically invalid.
      Signed-off-by: NDr. David Alan Gilbert <linux@treblig.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca39599c
    • K
      mm: increase RECLAIM_DISTANCE to 30 · 32e45ff4
      KOSAKI Motohiro 提交于
      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
      a very traditional single-process model.
      
        * a master process which reads config files and manages the other
          process
        * multiple imapd processes, one per connection
        * multiple pop3d processes, one per connection
        * multiple lmtpd processes, one per connection
        * periodical "cleanup" processes.
      
      There are thousands of independent processes.  The problem is, recent
      Intel motherboard turn on zone_reclaim_mode by default and traditional
      prefork model software don't work well on it.  Unfortunatelly, such models
      are still typical even in the 21st century.  We can't ignore them.
      
      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
      such relatively cheap 2-4 socket machine are often used for traditional
      servers as above.  The intention is that these machines don't use
      zone_reclaim_mode.
      
      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
      This patch doesn't change such high-end NUMA machine behavior.
      
      Dave Hansen said:
      
      : I know specifically of pieces of x86 hardware that set the information
      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
      : behavior which that implies.
      :
      : They've done performance testing and run very large and scary benchmarks
      : to make sure that they _want_ this turned on.  What this means for them
      : is that they'll probably be de-optimized, at least on newer versions of
      : the kernel.
      :
      : If you want to do this for particular systems, maybe _that_'s what we
      : should do.  Have a list of specific configurations that need the
      : defaults overridden either because they're buggy, or they have an
      : unusual hardware configuration not really reflected in the distance
      : table.
      
      And later said:
      
      : The original change in the hardware tables was for the benefit of a
      : benchmark.  Said benchmark isn't going to get run on mainline until the
      : next batch of enterprise distros drops, at which point the hardware where
      : this was done will be irrelevant for the benchmark.  I'm sure any new
      : hardware will just set this distance to another yet arbitrary value to
      : make the kernel do what it wants.  :)
      :
      : Also, when the hardware got _set_ to this initially, I complained.  So, I
      : guess I'm getting my way now, with this patch.  I'm cool with it.
      Reported-by: NRobert Mueller <robm@fastmail.fm>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e45ff4