1. 21 1月, 2011 3 次提交
  2. 20 1月, 2011 6 次提交
    • J
      netfilter: xtables: remove duplicate member · ba12b130
      Jan Engelhardt 提交于
      Accidentally missed removing the old out-of-union "inverse" member,
      which caused the struct size to change which then gives size mismatch
      warnings when using an old iptables.
      
      It is interesting to see that gcc did not warn about this before.
      (Filed http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47376 )
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      ba12b130
    • J
      netfilter: xtables: remove extraneous header that slipped in · 5d844928
      Jan Engelhardt 提交于
      Commit 0b8ad876 (netfilter: xtables: add missing header files to export
      list) erroneously added this.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      5d844928
    • J
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend 提交于
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8970f0b
    • J
      net: implement mechanism for HW based QOS · 4f57c087
      John Fastabend 提交于
      This patch provides a mechanism for lower layer devices to
      steer traffic using skb->priority to tx queues. This allows
      for hardware based QOS schemes to use the default qdisc without
      incurring the penalties related to global state and the qdisc
      lock. While reliably receiving skbs on the correct tx ring
      to avoid head of line blocking resulting from shuffling in
      the LLD. Finally, all the goodness from txq caching and xps/rps
      can still be leveraged.
      
      Many drivers and hardware exist with the ability to implement
      QOS schemes in the hardware but currently these drivers tend
      to rely on firmware to reroute specific traffic, a driver
      specific select_queue or the queue_mapping action in the
      qdisc.
      
      By using select_queue for this drivers need to be updated for
      each and every traffic type and we lose the goodness of much
      of the upstream work. Firmware solutions are inherently
      inflexible. And finally if admins are expected to build a
      qdisc and filter rules to steer traffic this requires knowledge
      of how the hardware is currently configured. The number of tx
      queues and the queue offsets may change depending on resources.
      Also this approach incurs all the overhead of a qdisc with filters.
      
      With the mechanism in this patch users can set skb priority using
      expected methods ie setsockopt() or the stack can set the priority
      directly. Then the skb will be steered to the correct tx queues
      aligned with hardware QOS traffic classes. In the normal case with
      single traffic class and all queues in this class everything
      works as is until the LLD enables multiple tcs.
      
      To steer the skb we mask out the lower 4 bits of the priority
      and allow the hardware to configure upto 15 distinct classes
      of traffic. This is expected to be sufficient for most applications
      at any rate it is more then the 8021Q spec designates and is
      equal to the number of prio bands currently implemented in
      the default qdisc.
      
      This in conjunction with a userspace application such as
      lldpad can be used to implement 8021Q transmission selection
      algorithms one of these algorithms being the extended transmission
      selection algorithm currently being used for DCB.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f57c087
    • V
      net_device: add support for network device groups · cbda10fa
      Vlad Dogaru 提交于
      Net devices can now be grouped, enabling simpler manipulation from
      userspace. This patch adds a group field to the net_device structure, as
      well as rtnetlink support to query and modify it.
      Signed-off-by: NVlad Dogaru <ddvlad@rosedu.org>
      Acked-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbda10fa
    • J
      netfilter: xtables: connlimit revision 1 · cc4fc022
      Jan Engelhardt 提交于
      This adds destination address-based selection. The old "inverse"
      member is overloaded (memory-wise) with a new "flags" variable,
      similar to how J.Park did it with xt_string rev 1. Since revision 0
      userspace only sets flag 0x1, no great changes are made to explicitly
      test for different revisions.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      cc4fc022
  3. 19 1月, 2011 3 次提交
    • P
      netfilter: nf_conntrack_tstamp: add flow-based timestamp extension · a992ca2a
      Pablo Neira Ayuso 提交于
      This patch adds flow-based timestamping for conntracks. This
      conntrack extension is disabled by default. Basically, we use
      two 64-bits variables to store the creation timestamp once the
      conntrack has been confirmed and the other to store the deletion
      time. This extension is disabled by default, to enable it, you
      have to:
      
      echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp
      
      This patch allows to save memory for user-space flow-based
      loogers such as ulogd2. In short, ulogd2 does not need to
      keep a hashtable with the conntrack in user-space to know
      when they were created and destroyed, instead we use the
      kernel timestamp. If we want to have a sane IPFIX implementation
      in user-space, this nanosecs resolution timestamps are also
      useful. Other custom user-space applications can benefit from
      this via libnetfilter_conntrack.
      
      This patch modifies the /proc output to display the delta time
      in seconds since the flow start. You can also obtain the
      flow-start date by means of the conntrack-tools.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      a992ca2a
    • E
      net: filter: dont block softirqs in sk_run_filter() · 80f8f102
      Eric Dumazet 提交于
      Packet filter (BPF) doesnt need to disable softirqs, being fully
      re-entrant and lock-less.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80f8f102
    • J
      netfilter: nf_conntrack: nf_conntrack snmp helper · 93557f53
      Jiri Olsa 提交于
      Adding support for SNMP broadcast connection tracking. The SNMP
      broadcast requests are now paired with the SNMP responses.
      Thus allowing using SNMP broadcasts with firewall enabled.
      
      Please refer to the following conversation:
      http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
      
      Patrick McHardy wrote:
      > > The best solution would be to add generic broadcast tracking, the
      > > use of expectations for this is a bit of abuse.
      > > The second best choice I guess would be to move the help() function
      > > to a shared module and generalize it so it can be used for both.
      This patch implements the "second best choice".
      
      Since the netbios-ns conntrack module uses the same helper
      functionality as the snmp, only one helper function is added
      for both snmp and netbios-ns modules into the new object -
      nf_conntrack_broadcast.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      93557f53
  4. 18 1月, 2011 6 次提交
  5. 17 1月, 2011 10 次提交
    • N
      fs: fix address space warnings in ioctl_fiemap() · ecf5632d
      Namhyung Kim 提交于
      The fi_extents_start field of struct fiemap_extent_info is a
      user pointer but was not marked as __user. This makes sparse
      emit following warnings:
      
        CHECK   fs/ioctl.c
      fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
      fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
      fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
      fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:212:27:    got char *<noident>
      
      Also add 'ufiemap' variable to eliminate unnecessary casts.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf5632d
    • S
      fs: Remove unlikely() from fput_light() · c2b3e74b
      Steven Rostedt 提交于
      In fput_light(), there's an unlikely(fput_needed), which running on
      my normal desktop doing firefox, xchat, evolution and part of my distcc farm,
      and running the annotate branch profiler shows that the unlikely is not
      very unlikely.
      
       correct incorrect  %        Function             File              Line
       ------- ---------  -        --------             ----              ----
             0       48 100 fput_light                file.h               26
      115828710 897415279  88 fput_light              file.h               26
      865271179 5286128445  85 fput_light             file.h               26
      19568539  8923664  31 fput_light                file.h               26
      12353677  3562279  22 fput_light                file.h               26
        267691    67062  20 fput_light                file.h               26
      15014853   348172   2 fput_light                file.h               26
        209258      205   0 fput_light                file.h               26
       1364164        0   0 fput_light                file.h               26
      
      Which gives 1032903812 times it was correct and 6203351846 times it was
      incorrect, or 85% incorrect.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2b3e74b
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • T
      RDMA: Update workqueue usage · f0626710
      Tejun Heo 提交于
      * ib_wq is added, which is used as the common workqueue for infiniband
        instead of the system workqueue.  All system workqueue usages
        including flush_scheduled_work() callers are converted to use and
        flush ib_wq.
      
      * cancel_delayed_work() + flush_scheduled_work() converted to
        cancel_delayed_work_sync().
      
      * qib_wq is removed and ib_wq is used instead.
      
      This is to prepare for deprecation of flush_scheduled_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f0626710
    • R
      ARM: PL08x: cleanup comments · 94ae8522
      Russell King - ARM Linux 提交于
      Cleanup the formatting of comments, remove some which don't make sense
      anymore.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      [fix conflict with 96a608a4]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      94ae8522
    • A
      fix non-x86 build failure in pmdp_get_and_clear · b3697c02
      Andrea Arcangeli 提交于
      pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
      BUG() and they were defined only to diminish the risk of build issues on
      not-x86 archs and to be consistent with the generic pte methods previously
      defined in include/asm-generic/pgtable.h.
      
      But they are causing more trouble than they were supposed to solve, so
      it's simpler not to define them when THP is off.
      
      This is also correcting the export of pmdp_splitting_flush which is
      currently unused (x86 isn't using the generic implementation in
      mm/pgtable-generic.c and no other arch needs that [yet]).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3697c02
    • R
      PCI / ACPI: Fix build of the AER driver for CONFIG_ACPI unset · fc8fe1e9
      Rafael J. Wysocki 提交于
      After commit 415e12b2 ("PCI/ACPI: Request _OSC control once for each
      root bridge (v3)") include/linux/pci-acpi.h is included by
      drivers/pci/pcie/aer/aerdrv.c and if CONFIG_ACPI is unset, the bogus and
      unnecessary alternative definition of acpi_find_root_bridge_handle()
      causes a build error to occur.
      
      Remove the offending piece of garbage.
      Reported-and-tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc8fe1e9
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
    • T
      netfilter: create audit records for x_tables replaces · fbabf31e
      Thomas Graf 提交于
      The setsockopt() syscall to replace tables is already recorded
      in the audit logs. This patch stores additional information
      such as table name and netfilter protocol.
      
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      fbabf31e
    • T
      netfilter: audit target to record accepted/dropped packets · 43f393ca
      Thomas Graf 提交于
      This patch adds a new netfilter target which creates audit records
      for packets traversing a certain chain.
      
      It can be used to record packets which are rejected administraively
      as follows:
      
        -N AUDIT_DROP
        -A AUDIT_DROP -j AUDIT --type DROP
        -A AUDIT_DROP -j DROP
      
      a rule which would typically drop or reject a packet would then
      invoke the new chain to record packets before dropping them.
      
        -j AUDIT_DROP
      
      The module is protocol independant and works for iptables, ip6tables
      and ebtables.
      
      The following information is logged:
       - netfilter hook
       - packet length
       - incomming/outgoing interface
       - MAC src/dst/proto for ethernet packets
       - src/dst/protocol address for IPv4/IPv6
       - src/dst port for TCP/UDP/UDPLITE
       - icmp type/code
      
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      43f393ca
  6. 16 1月, 2011 8 次提交
    • G
      dt/flattree: Return virtual address from early_init_dt_alloc_memory_arch() · 672c5446
      Grant Likely 提交于
      The physical address is never used by the device tree code when
      allocating memory for unflattening.  Change the architecture's alloc
      hook to return the virutal address instead.
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      672c5446
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
    • D
      Allow d_manage() to be used in RCU-walk mode · ab90911f
      David Howells 提交于
      Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
      as when it is in Ref-walk mode.  This permits __follow_mount_rcu() to call
      d_manage() directly.  d_manage() needs a parameter to indicate that it is in
      RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
      -ECHILD instead).
      
      autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
      accesses it and otherwise request dropping back to ref-walk mode.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ab90911f
    • I
      autofs4: Bump version · 1972580b
      Ian Kent 提交于
      Increase the autofs module sub-version so we can tell what kernel
      implementation is being used from user space debug logging.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1972580b
    • D
      NFS: Use d_automount() rather than abusing follow_link() · 36d43a43
      David Howells 提交于
      Make NFS use the new d_automount() dentry operation rather than abusing
      follow_link() on directories.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      36d43a43
    • D
      Add an AT_NO_AUTOMOUNT flag to suppress terminal automount · 6f45b656
      David Howells 提交于
      Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
      point directories.  This can be used by fstatat() users to permit the
      gathering of attributes on an automount point and also prevent
      mass-automounting of a directory of automount points by ls.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6f45b656
    • D
      Add a dentry op to allow processes to be held during pathwalk transit · cc53ce53
      David Howells 提交于
      Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
      sleep when it tries to transit away from one of that filesystem's directories
      during a pathwalk.  The operation is keyed off a new dentry flag
      (DCACHE_MANAGE_TRANSIT).
      
      The filesystem is allowed to be selective about which processes it holds and
      which it permits to continue on or prohibits from transiting from each flagged
      directory.  This will allow autofs to hold up client processes whilst letting
      its userspace daemon through to maintain the directory or the stuff behind it
      or mounted upon it.
      
      The ->d_manage() dentry operation:
      
      	int (*d_manage)(struct path *path, bool mounting_here);
      
      takes a pointer to the directory about to be transited away from and a flag
      indicating whether the transit is undertaken by do_add_mount() or
      do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
      
      It should return 0 if successful and to let the process continue on its way;
      -EISDIR to prohibit the caller from skipping to overmounted filesystems or
      automounting, and to use this directory; or some other error code to return to
      the user.
      
      ->d_manage() is called with namespace_sem writelocked if mounting_here is true
      and no other locks held, so it may sleep.  However, if mounting_here is true,
      it may not initiate or wait for a mount or unmount upon the parameter
      directory, even if the act is actually performed by userspace.
      
      Within fs/namei.c, follow_managed() is extended to check with d_manage() first
      on each managed directory, before transiting away from it or attempting to
      automount upon it.
      
      follow_down() is renamed follow_down_one() and should only be used where the
      filesystem deliberately intends to avoid management steps (e.g. autofs).
      
      A new follow_down() is added that incorporates the loop done by all other
      callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
      and CIFS do use it, their use is removed by converting them to use
      d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
      also takes an extra parameter to indicate if it is being called from mount code
      (with namespace_sem writelocked) which it passes to d_manage().  follow_down()
      ignores automount points so that it can be used to mount on them.
      
      __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
      DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
      sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
      that determine whether to abort or not itself.  That would allow the autofs
      daemon to continue on in rcu-walk mode.
      
      Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
      required as every tranist from that directory will cause d_manage() to be
      invoked.  It can always be set again when necessary.
      
      ==========================
      WHAT THIS MEANS FOR AUTOFS
      ==========================
      
      Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
      trigger the automounting of indirect mounts, and both of these can be called
      with i_mutex held.
      
      autofs knows that the i_mutex will be held by the caller in lookup(), and so
      can drop it before invoking the daemon - but this isn't so for d_revalidate(),
      since the lock is only held on _some_ of the code paths that call it.  This
      means that autofs can't risk dropping i_mutex from its d_revalidate() function
      before it calls the daemon.
      
      The bug could manifest itself as, for example, a process that's trying to
      validate an automount dentry that gets made to wait because that dentry is
      expired and needs cleaning up:
      
      	mkdir         S ffffffff8014e05a     0 32580  24956
      	Call Trace:
      	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
      	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
      	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
      	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
      	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
      	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
      	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
      	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
      
      versus the automount daemon which wants to remove that dentry, but can't
      because the normal process is holding the i_mutex lock:
      
      	automount     D ffffffff8014e05a     0 32581      1              32561
      	Call Trace:
      	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
      	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
      	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
      	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
      	 [<ffffffff8005d229>] tracesys+0x71/0xe0
      	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
      
      which means that the system is deadlocked.
      
      This patch allows autofs to hold up normal processes whilst the daemon goes
      ahead and does things to the dentry tree behind the automouter point without
      risking a deadlock as almost no locks are held in d_manage() and none in
      d_automount().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc53ce53
    • D
      Add a dentry op to handle automounting rather than abusing follow_link() · 9875cf80
      David Howells 提交于
      Add a dentry op (d_automount) to handle automounting directories rather than
      abusing the follow_link() inode operation.  The operation is keyed off a new
      dentry flag (DCACHE_NEED_AUTOMOUNT).
      
      This also makes it easier to add an AT_ flag to suppress terminal segment
      automount during pathwalk and removes the need for the kludge code in the
      pathwalk algorithm to handle directories with follow_link() semantics.
      
      The ->d_automount() dentry operation:
      
      	struct vfsmount *(*d_automount)(struct path *mountpoint);
      
      takes a pointer to the directory to be mounted upon, which is expected to
      provide sufficient data to determine what should be mounted.  If successful, it
      should return the vfsmount struct it creates (which it should also have added
      to the namespace using do_add_mount() or similar).  If there's a collision with
      another automount attempt, NULL should be returned.  If the directory specified
      by the parameter should be used directly rather than being mounted upon,
      -EISDIR should be returned.  In any other case, an error code should be
      returned.
      
      The ->d_automount() operation is called with no locks held and may sleep.  At
      this point the pathwalk algorithm will be in ref-walk mode.
      
      Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
      added to handle mountpoints.  It will return -EREMOTE if the automount flag was
      set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
      symlinks or mountpoints, -EISDIR if the walk point should be used without
      mounting and 0 if successful.  The path will be updated to point to the mounted
      filesystem if a successful automount took place.
      
      __follow_mount() is replaced by follow_managed() which is more generic
      (especially with the patch that adds ->d_manage()).  This handles transits from
      directories during pathwalk, including automounting and skipping over
      mountpoints (and holding processes with the next patch).
      
      __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
      automount point with nothing mounted on it.
      
      follow_dotdot*() does not handle automounts as you don't want to trigger them
      whilst following "..".
      
      I've also extracted the mount/don't-mount logic from autofs4 and included it
      here.  It makes the mount go ahead anyway if someone calls open() or creat(),
      tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
      or sticks a '/' on the end of the pathname.  If they do a stat(), however,
      they'll only trigger the automount if they didn't also say O_NOFOLLOW.
      
      I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
      inodes as automount points.  This flag is automatically propagated to the
      dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
      save AFS a private flag bit apiece, but is not strictly necessary.  It would be
      preferable to do the propagation in d_set_d_op(), but that doesn't normally
      have access to the inode.
      
      [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
      succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
      that, rather than just returning with ungrabbed *path]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9875cf80
  7. 15 1月, 2011 4 次提交