1. 17 1月, 2011 1 次提交
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
  2. 16 1月, 2011 7 次提交
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
    • D
      Allow d_manage() to be used in RCU-walk mode · ab90911f
      David Howells 提交于
      Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
      as when it is in Ref-walk mode.  This permits __follow_mount_rcu() to call
      d_manage() directly.  d_manage() needs a parameter to indicate that it is in
      RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
      -ECHILD instead).
      
      autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
      accesses it and otherwise request dropping back to ref-walk mode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ab90911f
    • I
      autofs4: Bump version · 1972580b
      Ian Kent 提交于
      Increase the autofs module sub-version so we can tell what kernel
      implementation is being used from user space debug logging.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1972580b
    • D
      NFS: Use d_automount() rather than abusing follow_link() · 36d43a43
      David Howells 提交于
      Make NFS use the new d_automount() dentry operation rather than abusing
      follow_link() on directories.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      36d43a43
    • D
      Add an AT_NO_AUTOMOUNT flag to suppress terminal automount · 6f45b656
      David Howells 提交于
      Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
      point directories.  This can be used by fstatat() users to permit the
      gathering of attributes on an automount point and also prevent
      mass-automounting of a directory of automount points by ls.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6f45b656
    • D
      Add a dentry op to allow processes to be held during pathwalk transit · cc53ce53
      David Howells 提交于
      Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
      sleep when it tries to transit away from one of that filesystem's directories
      during a pathwalk.  The operation is keyed off a new dentry flag
      (DCACHE_MANAGE_TRANSIT).
      
      The filesystem is allowed to be selective about which processes it holds and
      which it permits to continue on or prohibits from transiting from each flagged
      directory.  This will allow autofs to hold up client processes whilst letting
      its userspace daemon through to maintain the directory or the stuff behind it
      or mounted upon it.
      
      The ->d_manage() dentry operation:
      
      	int (*d_manage)(struct path *path, bool mounting_here);
      
      takes a pointer to the directory about to be transited away from and a flag
      indicating whether the transit is undertaken by do_add_mount() or
      do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
      
      It should return 0 if successful and to let the process continue on its way;
      -EISDIR to prohibit the caller from skipping to overmounted filesystems or
      automounting, and to use this directory; or some other error code to return to
      the user.
      
      ->d_manage() is called with namespace_sem writelocked if mounting_here is true
      and no other locks held, so it may sleep.  However, if mounting_here is true,
      it may not initiate or wait for a mount or unmount upon the parameter
      directory, even if the act is actually performed by userspace.
      
      Within fs/namei.c, follow_managed() is extended to check with d_manage() first
      on each managed directory, before transiting away from it or attempting to
      automount upon it.
      
      follow_down() is renamed follow_down_one() and should only be used where the
      filesystem deliberately intends to avoid management steps (e.g. autofs).
      
      A new follow_down() is added that incorporates the loop done by all other
      callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
      and CIFS do use it, their use is removed by converting them to use
      d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
      also takes an extra parameter to indicate if it is being called from mount code
      (with namespace_sem writelocked) which it passes to d_manage().  follow_down()
      ignores automount points so that it can be used to mount on them.
      
      __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
      DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
      sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
      that determine whether to abort or not itself.  That would allow the autofs
      daemon to continue on in rcu-walk mode.
      
      Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
      required as every tranist from that directory will cause d_manage() to be
      invoked.  It can always be set again when necessary.
      
      ==========================
      WHAT THIS MEANS FOR AUTOFS
      ==========================
      
      Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
      trigger the automounting of indirect mounts, and both of these can be called
      with i_mutex held.
      
      autofs knows that the i_mutex will be held by the caller in lookup(), and so
      can drop it before invoking the daemon - but this isn't so for d_revalidate(),
      since the lock is only held on _some_ of the code paths that call it.  This
      means that autofs can't risk dropping i_mutex from its d_revalidate() function
      before it calls the daemon.
      
      The bug could manifest itself as, for example, a process that's trying to
      validate an automount dentry that gets made to wait because that dentry is
      expired and needs cleaning up:
      
      	mkdir         S ffffffff8014e05a     0 32580  24956
      	Call Trace:
      	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
      	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
      	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
      	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
      	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
      	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
      	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
      	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
      
      versus the automount daemon which wants to remove that dentry, but can't
      because the normal process is holding the i_mutex lock:
      
      	automount     D ffffffff8014e05a     0 32581      1              32561
      	Call Trace:
      	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
      	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
      	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
      	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
      	 [<ffffffff8005d229>] tracesys+0x71/0xe0
      	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
      
      which means that the system is deadlocked.
      
      This patch allows autofs to hold up normal processes whilst the daemon goes
      ahead and does things to the dentry tree behind the automouter point without
      risking a deadlock as almost no locks are held in d_manage() and none in
      d_automount().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc53ce53
    • D
      Add a dentry op to handle automounting rather than abusing follow_link() · 9875cf80
      David Howells 提交于
      Add a dentry op (d_automount) to handle automounting directories rather than
      abusing the follow_link() inode operation.  The operation is keyed off a new
      dentry flag (DCACHE_NEED_AUTOMOUNT).
      
      This also makes it easier to add an AT_ flag to suppress terminal segment
      automount during pathwalk and removes the need for the kludge code in the
      pathwalk algorithm to handle directories with follow_link() semantics.
      
      The ->d_automount() dentry operation:
      
      	struct vfsmount *(*d_automount)(struct path *mountpoint);
      
      takes a pointer to the directory to be mounted upon, which is expected to
      provide sufficient data to determine what should be mounted.  If successful, it
      should return the vfsmount struct it creates (which it should also have added
      to the namespace using do_add_mount() or similar).  If there's a collision with
      another automount attempt, NULL should be returned.  If the directory specified
      by the parameter should be used directly rather than being mounted upon,
      -EISDIR should be returned.  In any other case, an error code should be
      returned.
      
      The ->d_automount() operation is called with no locks held and may sleep.  At
      this point the pathwalk algorithm will be in ref-walk mode.
      
      Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
      added to handle mountpoints.  It will return -EREMOTE if the automount flag was
      set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
      symlinks or mountpoints, -EISDIR if the walk point should be used without
      mounting and 0 if successful.  The path will be updated to point to the mounted
      filesystem if a successful automount took place.
      
      __follow_mount() is replaced by follow_managed() which is more generic
      (especially with the patch that adds ->d_manage()).  This handles transits from
      directories during pathwalk, including automounting and skipping over
      mountpoints (and holding processes with the next patch).
      
      __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
      automount point with nothing mounted on it.
      
      follow_dotdot*() does not handle automounts as you don't want to trigger them
      whilst following "..".
      
      I've also extracted the mount/don't-mount logic from autofs4 and included it
      here.  It makes the mount go ahead anyway if someone calls open() or creat(),
      tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
      or sticks a '/' on the end of the pathname.  If they do a stat(), however,
      they'll only trigger the automount if they didn't also say O_NOFOLLOW.
      
      I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
      inodes as automount points.  This flag is automatically propagated to the
      dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
      save AFS a private flag bit apiece, but is not strictly necessary.  It would be
      preferable to do the propagation in d_set_d_op(), but that doesn't normally
      have access to the inode.
      
      [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
      succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
      that, rather than just returning with ungrabbed *path]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9875cf80
  3. 15 1月, 2011 5 次提交
  4. 14 1月, 2011 27 次提交
    • K
      Revert update for dirty_ratio for memcg. · 836cb711
      KAMEZAWA Hiroyuki 提交于
      The flags added by commit db16d5ec
      has no user now. We believe we'll use it soon but considering
      patch reviewing, the change itself should be folded into incoming
      set of "dirty ratio for memcg" patches.
      
      So, it's better to drop this change from current mainline tree.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      836cb711
    • M
      power_supply: Add MAX17042 Fuel Gauge Driver · 359ab9f5
      MyungJoo Ham 提交于
      The MAX17042 is a fuel gauge with an I2C interface for lithium-ion
      betteries. Unlike its predecessor MAX17040, MAX17042 uses 16bit
      registers. Besides, MAX17042 has much more features than MAX17040; e.g.,
      a thermistor, current and current accumulation measurement, battery
      internal resistance estimate, average values of measurement, and others.
      
      This patch implements a driver for MAX17042.
      In this initial release, we have implemented the most basic features of
      a fuel gauge: measure the battery capacity and voltage.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      359ab9f5
    • R
      kernel: fix hlist_bl again · 32385c7c
      Russell King 提交于
      __d_rehash is dereferencing an almost-NULL pointer on my ARM926.
      CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y.
      
      The faulting instruction is:    strne   r3, [r2, #4]
      and as can be seen from the register dump below, r2 is 0x00000001, hence
      the faulting 0x00000005 address.
      
      __d_rehash is essentially:
      
             spin_lock_bucket(b);
             entry->d_flags &= ~DCACHE_UNHASHED;
             hlist_bl_add_head_rcu(&entry->d_hash, &b->head);
             spin_unlock_bucket(b);
      
      which is:
      
             bit_spin_lock(0, (unsigned long *)&b->head.first);
             entry->d_flags &= ~DCACHE_UNHASHED;
             hlist_bl_add_head_rcu(&entry->d_hash, &b->head);
             __bit_spin_unlock(0, (unsigned long *)&b->head.first);
      
      bit_spin_lock(0, ptr) sets bit 0 of *ptr, in this case b->head.first if
      CONFIG_SMP or CONFIG_DEBUG_SPINLOCK is set:
      
      #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
             while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
                     while (test_bit(bitnum, addr)) {
                             preempt_enable();
                             cpu_relax();
                             preempt_disable();
                     }
             }
      #endif
      
      So, b->head.first starts off NULL, and becomes a non-NULL (address 1).
      hlist_bl_add_head_rcu() does this:
      
      static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
                                             struct hlist_bl_head *h)
      {
             first = hlist_bl_first(h);
             n->next = first;
             if (first)
                     first->pprev = &n->next;
      
      It is the store to first->pprev which is faulting.
      
      hlist_bl_first():
      
      static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
      {
             return (struct hlist_bl_node *)
                     ((unsigned long)h->first & ~LIST_BL_LOCKMASK);
      }
      
      but:
      #if defined(CONFIG_SMP)
      #define LIST_BL_LOCKMASK        1UL
      #else
      #define LIST_BL_LOCKMASK        0UL
      #endif
      
      So, we have one piece of code which sets bit 0 of addresses, and another
      bit of code which doesn't clear it before dereferencing the pointer if
      !CONFIG_SMP && CONFIG_DEBUG_SPINLOCK.  With the patch below, I can again
      sucessfully boot the kernel on my Versatile PB/926 platform.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      32385c7c
    • M
      mfd: ab8500-core chip version cut 2.0 support · 92d50a41
      Mattias Wallin 提交于
      This patch adds support for chip version 2.0 or cut 2.0.
      One new interrupt latch register - latch 12 - is introduced.
      Signed-off-by: NMattias Wallin <mattias.wallin@stericsson.com>
      Acked-by: NLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      92d50a41
    • M
      regulator: Support MAX8998/LP3974 DVS-GPIO · 735a3d9e
      MyungJoo Ham 提交于
      The previous driver did not support BUCK1-DVS3, BUCK1-DVS4, and
      BUCK2-DVS2 modes. This patch adds such modes and an option to block
      setting buck1/2 voltages out of the preset values.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      735a3d9e
    • M
      mfd: Support LP3974 RTC · 337ce5d1
      MyungJoo Ham 提交于
      The first releases of LP3974 have a large delay in RTC registers,
      which requires 2 seconds of delay after writing to a rtc register
      (recommended by National Semiconductor's engineers)
      before reading it.
      
      If "rtc_delay" field of the platform data is true, the rtc driver
      assumes that such delays are required. Although we have not seen
      LP3974s without requiring such delays, we assume that such LP3974s
      will be released soon (or they have done so already) and they are
      supported by "lp3974" without setting "rtc_delay" at the platform
      data.
      
      This patch adds delays with msleep when writing values to RTC registers
      if the platform data has rtc_delay set.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Reviewed-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      337ce5d1
    • M
      mfd: MAX8998/LP3974 hibernation support · cdd137c9
      MyungJoo Ham 提交于
      This patch makes the driver to save and restore register values
      for hibernation.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      cdd137c9
    • M
      mfd: ab8500-core ioresources irq for subdrivers added · e098aded
      Mattias Wallin 提交于
      This patch adds the ioresources used by subdrivers to
      retrieve their interrupt.
      Signed-off-by: NMattias Wallin <mattias.wallin@stericsson.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      e098aded
    • M
      mfd: Provide pm_runtime_no_callbacks flag in cell data · 4c90aa94
      Mark Brown 提交于
      Allow MFD cells to have pm_runtime_no_callbacks() called on them during
      registration. This causes the runtime PM framework to ignore them,
      allowing use of runtime PM to suspend the device as a whole even if
      not all drivers for the MFD can usefully implement runtime PM. For
      example, RTCs are likely to run continuously regardless of the power
      state of the system.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      4c90aa94
    • M
      mfd: Add WM8326 support · 412dc11d
      Mark Brown 提交于
      The WM8326 is a high performance variant of the WM832x series with
      no software visible differences.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      412dc11d
    • T
      etherdevice.h: Add is_unicast_ether_addr function · 51e7eed7
      Tobias Klauser 提交于
      From a check for !is_multicast_ether_addr it is not always obvious that
      we're checking for a unicast address. So add this helper function to
      make those code paths easier to read.
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51e7eed7
    • E
      net: remove dev_txq_stats_fold() · 1ac9ad13
      Eric Dumazet 提交于
      After recent changes, (percpu stats on vlan/tunnels...), we dont need
      anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.
      
      Only remaining users are ixgbe, sch_teql, gianfar & macvlan :
      
      1) ixgbe can be converted to use existing tx_ring counters.
      
      2) macvlan incremented txq->tx_dropped, it can use the
      dev->stats.tx_dropped counter.
      
      3) sch_teql : almost revert ab35cd4b (Use net_device internal stats)
          Now we have ndo_get_stats64(), use it, even for "unsigned long"
      fields (No need to bring back a struct net_device_stats)
      
      4) gianfar adds a stats structure per tx queue to hold
      tx_bytes/tx_packets
      
      This removes a lockdep warning (and possible lockup) in rndis gadget,
      calling dev_get_stats() from hard IRQ context.
      
      Ref: http://www.spinics.net/lists/netdev/msg149202.htmlReported-by: NNeil Jones <neiljay@gmail.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
      CC: Michal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ac9ad13
    • N
      fs: hlist UP debug fixup · 2c675598
      Nick Piggin 提交于
      Po-Yu Chuang <ratbert.chuang@gmail.com> noticed that hlist_bl_set_first could
      crash on a UP system when LIST_BL_LOCKMASK is 0, because
      
      	LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));
      
      always evaulates to true.
      
      Fix the expression, and also avoid a dependency between bit spinlock
      implementation and list bl code (list code shouldn't know anything
      except that bit 0 is set when adding and removing elements). Eventually
      if a good use case comes up, we might use this list to store 1 or more
      arbitrary bits of data, so it really shouldn't be tied to locking either,
      but for now they are helpful for debugging.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2c675598
    • J
      nfsd: don't support msnfs export option · 9ce137ee
      J. Bruce Fields 提交于
      We've long had these pointless #ifdef MSNFS's sprinkled throughout the
      code--pointless because MSNFS is always defined (and we give no config
      option to make that easy to change).  So we could just remove the
      ifdef's and compile the resulting code unconditionally.
      
      But as long as we're there: why not just rip out this code entirely?
      The only purpose is to implement the "msnfs" export option which turns
      on Windows-like behavior in some cases, and:
      
      	- the export option isn't documented anywhere;
      	- the userland utilities (which would need to be able to parse
      	  "msnfs" in an export file) don't support it;
      	- I don't know how to maintain this, as I don't know what the
      	  proper behavior is; and
      	- google shows no evidence that anyone has ever used this.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9ce137ee
    • D
      memcg: fix memory migration of shmem swapcache · 50de1dd9
      Daisuke Nishimura 提交于
      In the current implementation mem_cgroup_end_migration() decides whether
      the page migration has succeeded or not by checking "oldpage->mapping".
      
      But if we are tring to migrate a shmem swapcache, the page->mapping of it
      is NULL from the begining, so the check would be invalid.  As a result,
      mem_cgroup_end_migration() assumes the migration has succeeded even if
      it's not, so "newpage" would be freed while it's not uncharged.
      
      This patch fixes it by passing mem_cgroup_end_migration() the result of
      the page migration.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50de1dd9
    • K
      memcg: add lock to synchronize page accounting and migration · dbd4ea78
      KAMEZAWA Hiroyuki 提交于
      Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
      accounting and migration code.  This reworks the locking scheme of
      _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
      which is always taken under IRQ disable.
      
      1. If pages are being migrated from a memcg, then updates to that
         memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
         move_lock_page_cgroup().  In an upcoming commit, memcg dirty page
         accounting will be updating memcg page accounting (specifically: num
         writeback pages) from IRQ context (softirq).  Avoid a deadlocking
         nested spin lock attempt by disabling irq on the local processor when
         grabbing the PCG_MOVE_LOCK.
      
      2. lock for update_page_stat is used only for avoiding race with
         move_account().  So, IRQ awareness of lock_page_cgroup() itself is not
         a problem.  The problem is between mem_cgroup_update_page_stat() and
         mem_cgroup_move_account_page().
      
      Trade-off:
        * Changing lock_page_cgroup() to always disable IRQ (or
          local_bh) has some impacts on performance and I think
          it's bad to disable IRQ when it's not necessary.
        * adding a new lock makes move_account() slower.  Score is
          here.
      
      Performance Impact: moving a 8G anon process.
      
      Before:
      	real    0m0.792s
      	user    0m0.000s
      	sys     0m0.780s
      
      After:
      	real    0m0.854s
      	user    0m0.000s
      	sys     0m0.842s
      
      This score is bad but planned patches for optimization can reduce
      this impact.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrea Righi <arighi@develer.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbd4ea78
    • G
      memcg: create extensible page stat update routines · 2a7106f2
      Greg Thelen 提交于
      Replace usage of the mem_cgroup_update_file_mapped() memcg
      statistic update routine with two new routines:
      * mem_cgroup_inc_page_stat()
      * mem_cgroup_dec_page_stat()
      
      As before, only the file_mapped statistic is managed.  However, these more
      general interfaces allow for new statistics to be more easily added.  New
      statistics are added with memcg dirty page accounting.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NAndrea Righi <arighi@develer.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a7106f2
    • G
      memcg: add page_cgroup flags for dirty page tracking · db16d5ec
      Greg Thelen 提交于
      This patchset provides the ability for each cgroup to have independent
      dirty page limits.
      
      Limiting dirty memory is like fixing the max amount of dirty (hard to
      reclaim) page cache used by a cgroup.  So, in case of multiple cgroup
      writers, they will not be able to consume more than their designated share
      of dirty pages and will be forced to perform write-out if they cross that
      limit.
      
      The patches are based on a series proposed by Andrea Righi in Mar 2010.
      
      Overview:
      
      - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
        unstable.
      
      - Extend mem_cgroup to record the total number of pages in each of the
        interesting dirty states (dirty, writeback, unstable_nfs).
      
      - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
        limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
        via cgroupfs control files.
      
      - Consider both system and per-memcg dirty limits in page writeback when
        deciding to queue background writeback or block for foreground writeback.
      
      Known shortcomings:
      
      - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
        writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
        just inodes contributing dirty pages to the cgroup exceeding its limit.
      
      - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
        implementation detail.  An enhanced implementation is needed to check the
        chain of parents to ensure that no dirty limit is exceeded.
      
      Performance data:
      - A page fault microbenchmark workload was used to measure performance, which
        can be called in read or write mode:
              f = open(foo. $cpu)
              truncate(f, 4096)
              alarm(60)
              while (1) {
                      p = mmap(f, 4096)
                      if (write)
      			*p = 1
      		else
      			x = *p
                      munmap(p)
              }
      
      - The workload was called for several points in the patch series in different
        modes:
        - s_read is a single threaded reader
        - s_write is a single threaded writer
        - p_read is a 16 thread reader, each operating on a different file
        - p_write is a 16 thread writer, each operating on a different file
      
      - Measurements were collected on a 16 core non-numa system using "perf stat
        --repeat 3".  The -a option was used for parallel (p_*) runs.
      
      - All numbers are page fault rate (M/sec).  Higher is better.
      
      - To compare the performance of a kernel without non-memcg compare the first and
        last rows, neither has memcg configured.  The first row does not include any
        of these memcg patches.
      
      - To compare the performance of using memcg dirty limits, compare the baseline
        (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
        row titled "all patches").
      
                                 root_cgroup                    child_cgroup
                       s_read s_write p_read p_write   s_read s_write p_read p_write
      mmotm w/o memcg   0.428  0.390   0.429  0.388
      mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
      all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
      all patches       0.431  0.402   0.427  0.395
        w/o memcg
      
      This patch:
      
      Add additional flags to page_cgroup to track dirty pages within a
      mem_cgroup.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrea Righi <arighi@develer.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db16d5ec
    • M
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree... · 29c1f677
      Mel Gorman 提交于
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration
      
      migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
      anonymous pages, as introduced by git commit
      989f89c5 ("fix rcu_read_lock() in page
      migraton").  The point of the RCU protection there is part of getting a
      stable reference to anon_vma and is only held for anon pages as file pages
      are locked which is sufficient protection against freeing.
      
      However, while a file page's mapping is being migrated, the radix tree is
      double checked to ensure it is the expected page.  This uses
      radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
      triggering the following warning.
      
      [  173.674290] ===================================================
      [  173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
      [  173.676016] ---------------------------------------------------
      [  173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
      [  173.676016]
      [  173.676016] other info that might help us debug this:
      [  173.676016]
      [  173.676016]
      [  173.676016] rcu_scheduler_active = 1, debug_locks = 0
      [  173.676016] 1 lock held by hugeadm/2899:
      [  173.676016]  #0:  (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
      [  173.676016]
      [  173.676016] stack backtrace:
      [  173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
      [  173.676016] Call Trace:
      [  173.676016]  [<c128cc01>] ? printk+0x14/0x1b
      [  173.676016]  [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
      [  173.676016]  [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
      [  173.676016]  [<c10e41ad>] migrate_page+0x23/0x39
      [  173.676016]  [<c10e491b>] buffer_migrate_page+0x22/0x107
      [  173.676016]  [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
      [  173.676016]  [<c10e425d>] move_to_new_page+0x9a/0x1ae
      [  173.676016]  [<c10e47e6>] migrate_pages+0x1e7/0x2fa
      
      This patch introduces radix_tree_deref_slot_protected() which calls
      rcu_dereference_protected().  Users of it must pass in the
      mapping->tree_lock that is protecting this dereference.  Holding the tree
      lock protects against parallel updaters of the radix tree meaning that
      rcu_dereference_protected is allowable.
      
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.37.early]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c1f677
    • A
      thp: add compound_trans_head() helper · 22e5c47e
      Andrea Arcangeli 提交于
      Cleanup some code with common compound_trans_head helper.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22e5c47e
    • A
      thp: khugepaged: make khugepaged aware about madvise · 60ab3244
      Andrea Arcangeli 提交于
      MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
      mmap and before touching the memory.  While this is enough for most
      usages, it's little effort to make madvise more dynamic at runtime on an
      existing mapping by making khugepaged aware about madvise.
      
      MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
      fault (that may not ever happen if all pages are already mapped and the
      "enabled" knob was set to madvise during the initial page faults).
      
      MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
      collapsing pages where not needed.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60ab3244
    • A
      thp: madvise(MADV_NOHUGEPAGE) · a664b2d8
      Andrea Arcangeli 提交于
      Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
      hugepage backed.  Return -EINVAL if the vma is not of an anonymous type,
      or the feature isn't built into the kernel.  Never silently return
      success.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a664b2d8
    • A
      thp: compound_trans_order · 37c2ac78
      Andrea Arcangeli 提交于
      Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37c2ac78
    • R
      thp: fix anon memory statistics with transparent hugepages · 2c888cfb
      Rik van Riel 提交于
      Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
      statistics, so the Active(anon) and Inactive(anon) statistics in
      /proc/meminfo are correct.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c888cfb
    • A
      thp: use compaction in kswapd for GFP_ATOMIC order > 0 · 5a03b051
      Andrea Arcangeli 提交于
      This takes advantage of memory compaction to properly generate pages of
      order > 0 if regular page reclaim fails and priority level becomes more
      severe and we don't reach the proper watermarks.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a03b051
    • A
      thp: mmu_notifier_test_young · 8ee53820
      Andrea Arcangeli 提交于
      For GRU and EPT, we need gup-fast to set referenced bit too (this is why
      it's correct to return 0 when shadow_access_mask is zero, it requires
      gup-fast to set the referenced bit).  qemu-kvm access already sets the
      young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
      paging EPT minor fault we relay on gup-fast to signal the page is in
      use...
      
      We also need to check the young bits on the secondary pagetables for NPT
      and not nested shadow mmu as the data may never get accessed again by the
      primary pte.
      
      Without this closer accuracy, we'd have to remove the heuristic that
      avoids collapsing hugepages in hugepage virtual regions that have not even
      a single subpage in use.
      
      ->test_young is full backwards compatible with GRU and other usages that
      don't have young bits in pagetables set by the hardware and that should
      nuke the secondary mmu mappings when ->clear_flush_young runs just like
      EPT does.
      
      Removing the heuristic that checks the young bit in
      khugepaged/collapse_huge_page completely isn't so bad either probably but
      I thought it was worth it and this makes it reliable.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ee53820
    • A
      thp: avoid breaking huge pmd invariants in case of vma_adjust failures · 94fcc585
      Andrea Arcangeli 提交于
      An huge pmd can only be mapped if the corresponding 2M virtual range is
      fully contained in the vma.  At times the VM calls split_vma twice, if the
      first split_vma succeeds and the second fail, the first split_vma remains
      in effect and it's not rolled back.  For split_vma or vma_adjust to fail
      an allocation failure is needed so it's a very unlikely event (the out of
      memory killer would normally fire before any allocation failure is visible
      to kernel and userland and if an out of memory condition happens it's
      unlikely to happen exactly here).  Nevertheless it's safer to ensure that
      no huge pmd can be left around if the vma is adjusted in a way that can't
      fit hugepages anymore at the new vm_start/vm_end address.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94fcc585