- 21 1月, 2011 4 次提交
-
-
由 Eric Dumazet 提交于
This patch converts stab qdisc management to RCU, so that we can perform the qdisc_calculate_pkt_len() call before getting qdisc lock. This shortens the lock's held time in __dev_xmit_skb(). This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of cache misses and so reducing latencies. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In commit 37112105 (net: QDISC_STATE_RUNNING dont need atomic bit ops) I moved QDISC_STATE_RUNNING flag to __state container, located in the cache line containing qdisc lock and often dirtied fields. I now move TCQ_F_THROTTLED bit too, so that we let first cache line read mostly, and shared by all cpus. This should speedup HTB/CBQ for example. Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an "unsigned int" for __state container, reducing by 8 bytes Qdisc size. Introduce helpers to hide implementation details. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Patrick McHardy 提交于
net/built-in.o: In function `nf_conntrack_init_net': net/netfilter/nf_conntrack_core.c:1521: undefined reference to `nf_conntrack_tstamp_init' net/netfilter/nf_conntrack_core.c:1531: undefined reference to `nf_conntrack_tstamp_fini' Add dummy inline functions for the =n case to fix this. Reported-by: NJohn Fastabend <john.r.fastabend@intel.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Jan Engelhardt 提交于
Resolve these warnings on `make headers_check`: usr/include/linux/netfilter/xt_CT.h:7: found __[us]{8,16,32,64} type without #include <linux/types.h> ... Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
-
- 20 1月, 2011 6 次提交
-
-
由 Jan Engelhardt 提交于
Accidentally missed removing the old out-of-union "inverse" member, which caused the struct size to change which then gives size mismatch warnings when using an old iptables. It is interesting to see that gcc did not warn about this before. (Filed http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47376 ) Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
-
由 Jan Engelhardt 提交于
Commit 0b8ad876 (netfilter: xtables: add missing header files to export list) erroneously added this. Signed-off-by: NJan Engelhardt <jengelh@medozas.de> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 John Fastabend 提交于
This implements a mqprio queueing discipline that by default creates a pfifo_fast qdisc per tx queue and provides the needed configuration interface. Using the mqprio qdisc the number of tcs currently in use along with the range of queues alloted to each class can be configured. By default skbs are mapped to traffic classes using the skb priority. This mapping is configurable. Configurable parameters, struct tc_mqprio_qopt { __u8 num_tc; __u8 prio_tc_map[TC_BITMASK + 1]; __u8 hw; __u16 count[TC_MAX_QUEUE]; __u16 offset[TC_MAX_QUEUE]; }; Here the count/offset pairing give the queue alignment and the prio_tc_map gives the mapping from skb->priority to tc. The hw bit determines if the hardware should configure the count and offset values. If the hardware bit is set then the operation will fail if the hardware does not implement the ndo_setup_tc operation. This is to avoid undetermined states where the hardware may or may not control the queue mapping. Also minimal bounds checking is done on the count/offset to verify a queue does not exceed num_tx_queues and that queue ranges do not overlap. Otherwise it is left to user policy or hardware configuration to create useful mappings. It is expected that hardware QOS schemes can be implemented by creating appropriate mappings of queues in ndo_tc_setup(). One expected use case is drivers will use the ndo_setup_tc to map queue ranges onto 802.1Q traffic classes. This provides a generic mechanism to map network traffic onto these traffic classes and removes the need for lower layer drivers to know specifics about traffic types. Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Fastabend 提交于
This patch provides a mechanism for lower layer devices to steer traffic using skb->priority to tx queues. This allows for hardware based QOS schemes to use the default qdisc without incurring the penalties related to global state and the qdisc lock. While reliably receiving skbs on the correct tx ring to avoid head of line blocking resulting from shuffling in the LLD. Finally, all the goodness from txq caching and xps/rps can still be leveraged. Many drivers and hardware exist with the ability to implement QOS schemes in the hardware but currently these drivers tend to rely on firmware to reroute specific traffic, a driver specific select_queue or the queue_mapping action in the qdisc. By using select_queue for this drivers need to be updated for each and every traffic type and we lose the goodness of much of the upstream work. Firmware solutions are inherently inflexible. And finally if admins are expected to build a qdisc and filter rules to steer traffic this requires knowledge of how the hardware is currently configured. The number of tx queues and the queue offsets may change depending on resources. Also this approach incurs all the overhead of a qdisc with filters. With the mechanism in this patch users can set skb priority using expected methods ie setsockopt() or the stack can set the priority directly. Then the skb will be steered to the correct tx queues aligned with hardware QOS traffic classes. In the normal case with single traffic class and all queues in this class everything works as is until the LLD enables multiple tcs. To steer the skb we mask out the lower 4 bits of the priority and allow the hardware to configure upto 15 distinct classes of traffic. This is expected to be sufficient for most applications at any rate it is more then the 8021Q spec designates and is equal to the number of prio bands currently implemented in the default qdisc. This in conjunction with a userspace application such as lldpad can be used to implement 8021Q transmission selection algorithms one of these algorithms being the extended transmission selection algorithm currently being used for DCB. Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vlad Dogaru 提交于
Net devices can now be grouped, enabling simpler manipulation from userspace. This patch adds a group field to the net_device structure, as well as rtnetlink support to query and modify it. Signed-off-by: NVlad Dogaru <ddvlad@rosedu.org> Acked-by: NJamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jan Engelhardt 提交于
This adds destination address-based selection. The old "inverse" member is overloaded (memory-wise) with a new "flags" variable, similar to how J.Park did it with xt_string rev 1. Since revision 0 userspace only sets flag 0x1, no great changes are made to explicitly test for different revisions. Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
-
- 19 1月, 2011 3 次提交
-
-
由 Pablo Neira Ayuso 提交于
This patch adds flow-based timestamping for conntracks. This conntrack extension is disabled by default. Basically, we use two 64-bits variables to store the creation timestamp once the conntrack has been confirmed and the other to store the deletion time. This extension is disabled by default, to enable it, you have to: echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp This patch allows to save memory for user-space flow-based loogers such as ulogd2. In short, ulogd2 does not need to keep a hashtable with the conntrack in user-space to know when they were created and destroyed, instead we use the kernel timestamp. If we want to have a sane IPFIX implementation in user-space, this nanosecs resolution timestamps are also useful. Other custom user-space applications can benefit from this via libnetfilter_conntrack. This patch modifies the /proc output to display the delta time in seconds since the flow start. You can also obtain the flow-start date by means of the conntrack-tools. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Eric Dumazet 提交于
Packet filter (BPF) doesnt need to disable softirqs, being fully re-entrant and lock-less. Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Olsa 提交于
Adding support for SNMP broadcast connection tracking. The SNMP broadcast requests are now paired with the SNMP responses. Thus allowing using SNMP broadcasts with firewall enabled. Please refer to the following conversation: http://marc.info/?l=netfilter-devel&m=125992205006600&w=2 Patrick McHardy wrote: > > The best solution would be to add generic broadcast tracking, the > > use of expectations for this is a bit of abuse. > > The second best choice I guess would be to move the help() function > > to a shared module and generalize it so it can be used for both. This patch implements the "second best choice". Since the netbios-ns conntrack module uses the same helper functionality as the snmp, only one helper function is added for both snmp and netbios-ns modules into the new object - nf_conntrack_broadcast. Signed-off-by: NJiri Olsa <jolsa@redhat.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
- 18 1月, 2011 6 次提交
-
-
由 Florian Westphal 提交于
If an skb is to be NF_QUEUE'd, but no program has opened the queue, the packet is dropped. This adds a v2 target revision of xt_NFQUEUE that allows packets to continue through the ruleset instead. Because the actual queueing happens outside of the target context, the 'bypass' flag has to be communicated back to the netfilter core. Unfortunately the only choice to do this without adding a new function argument is to use the target function return value (i.e. the verdict). In the NF_QUEUE case, the upper 16bit already contain the queue number to use. The previous patch reduced NF_VERDICT_MASK to 0xff, i.e. we now have extra room for a new flag. If a hook issued a NF_QUEUE verdict, then the netfilter core will continue packet processing if the queueing hook returns -ESRCH (== "this queue does not exist") and the new NF_VERDICT_FLAG_QUEUE_BYPASS flag is set in the verdict value. Note: If the queue exists, but userspace does not consume packets fast enough, the skb will still be dropped. Signed-off-by: NFlorian Westphal <fwestphal@astaro.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Florian Westphal 提交于
NF_VERDICT_MASK is currently 0xffff. This is because the upper 16 bits are used to store errno (for NF_DROP) or the queue number (NF_QUEUE verdict). As there are up to 0xffff different queues available, there is no more room to store additional flags. At the moment there are only 6 different verdicts, i.e. we can reduce NF_VERDICT_MASK to 0xff to allow storing additional flags in the 0xff00 space. NF_VERDICT_BITS would then be reduced to 8, but because the value is exported to userspace, this might cause breakage; e.g.: e.g. 'queuenr = (1 << NF_VERDICT_BITS) | NF_QUEUE' would now break. Thus, remove NF_VERDICT_BITS usage in the kernel and move the old value to the 'userspace compat' section. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Changli Gao 提交于
My previous patch (netfilter: nf_nat: don't use atomic bit operation) made a mistake when converting atomic_set to a normal bit 'or'. IPS_*_BIT should be replaced with IPS_*. Signed-off-by: NChangli Gao <xiaosuo@gmail.com> Cc: Tim Gardner <tim.gardner@canonical.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Jan Engelhardt 提交于
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
-
由 Jan Engelhardt 提交于
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
-
由 Roberto Sassu 提交于
The definition of ECRYPTFS_SUPER_MAGIC has been moved to the include file 'linux/magic.h' to become available to other kernel subsystems. Signed-off-by: NRoberto Sassu <roberto.sassu@polito.it> Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
-
- 17 1月, 2011 10 次提交
-
-
由 Namhyung Kim 提交于
The fi_extents_start field of struct fiemap_extent_info is a user pointer but was not marked as __user. This makes sparse emit following warnings: CHECK fs/ioctl.c fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces) fs/ioctl.c:114:26: expected void [noderef] <asn:1>*dst fs/ioctl.c:114:26: got struct fiemap_extent *[assigned] dest fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces) fs/ioctl.c:202:14: expected void const volatile [noderef] <asn:1>*<noident> fs/ioctl.c:202:14: got struct fiemap_extent *[assigned] fi_extents_start fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces) fs/ioctl.c:212:27: expected void [noderef] <asn:1>*dst fs/ioctl.c:212:27: got char *<noident> Also add 'ufiemap' variable to eliminate unnecessary casts. Signed-off-by: NNamhyung Kim <namhyung@gmail.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Steven Rostedt 提交于
In fput_light(), there's an unlikely(fput_needed), which running on my normal desktop doing firefox, xchat, evolution and part of my distcc farm, and running the annotate branch profiler shows that the unlikely is not very unlikely. correct incorrect % Function File Line ------- --------- - -------- ---- ---- 0 48 100 fput_light file.h 26 115828710 897415279 88 fput_light file.h 26 865271179 5286128445 85 fput_light file.h 26 19568539 8923664 31 fput_light file.h 26 12353677 3562279 22 fput_light file.h 26 267691 67062 20 fput_light file.h 26 15014853 348172 2 fput_light file.h 26 209258 205 0 fput_light file.h 26 1364164 0 0 fput_light file.h 26 Which gives 1032903812 times it was correct and 6203351846 times it was incorrect, or 85% incorrect. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Currently all filesystems except XFS implement fallocate asynchronously, while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC I/O we really want our allocation on disk, especially for the !KEEP_SIZE case where we actually grow the file with user-visible zeroes. On the other hand always commiting the transaction is a bad idea for fast-path uses of fallocate like for example in recent Samba versions. Given that block allocation is a data plane operation anyway change it from an inode operation to a file operation so that we have the file structure available that lets us check for O_SYNC. This also includes moving the code around for a few of the filesystems, and remove the already unnedded S_ISDIR checks given that we only wire up fallocate for regular files. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Tejun Heo 提交于
* ib_wq is added, which is used as the common workqueue for infiniband instead of the system workqueue. All system workqueue usages including flush_scheduled_work() callers are converted to use and flush ib_wq. * cancel_delayed_work() + flush_scheduled_work() converted to cancel_delayed_work_sync(). * qib_wq is removed and ib_wq is used instead. This is to prepare for deprecation of flush_scheduled_work(). Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NRoland Dreier <rolandd@cisco.com>
-
由 Russell King - ARM Linux 提交于
Cleanup the formatting of comments, remove some which don't make sense anymore. Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk> [fix conflict with 96a608a4] Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Andrea Arcangeli 提交于
pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as BUG() and they were defined only to diminish the risk of build issues on not-x86 archs and to be consistent with the generic pte methods previously defined in include/asm-generic/pgtable.h. But they are causing more trouble than they were supposed to solve, so it's simpler not to define them when THP is off. This is also correcting the export of pmdp_splitting_flush which is currently unused (x86 isn't using the generic implementation in mm/pgtable-generic.c and no other arch needs that [yet]). Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Sam Ravnborg <sam@ravnborg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael J. Wysocki 提交于
After commit 415e12b2 ("PCI/ACPI: Request _OSC control once for each root bridge (v3)") include/linux/pci-acpi.h is included by drivers/pci/pcie/aer/aerdrv.c and if CONFIG_ACPI is unset, the bogus and unnecessary alternative definition of acpi_find_root_bridge_handle() causes a build error to occur. Remove the offending piece of garbage. Reported-and-tested-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Al Viro 提交于
Instead of splitting refcount between (per-cpu) mnt_count and (SMP-only) mnt_longrefs, make all references contribute to mnt_count again and keep track of how many are longterm ones. Accounting rules for longterm count: * 1 for each fs_struct.root.mnt * 1 for each fs_struct.pwd.mnt * 1 for having non-NULL ->mnt_ns * decrement to 0 happens only under vfsmount lock exclusive That allows nice common case for mntput() - since we can't drop the final reference until after mnt_longterm has reached 0 due to the rules above, mntput() can grab vfsmount lock shared and check mnt_longterm. If it turns out to be non-zero (which is the common case), we know that this is not the final mntput() and can just blindly decrement percpu mnt_count. Otherwise we grab vfsmount lock exclusive and do usual decrement-and-check of percpu mnt_count. For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm(); namespace.c uses the latter in places where we don't already hold vfsmount lock exclusive and opencodes a few remaining spots where we need to manipulate mnt_longterm. Note that we mostly revert the code outside of fs/namespace.c back to what we used to have; in particular, normal code doesn't need to care about two kinds of references, etc. And we get to keep the optimization Nick's variant had bought us... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Thomas Graf 提交于
The setsockopt() syscall to replace tables is already recorded in the audit logs. This patch stores additional information such as table name and netfilter protocol. Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NThomas Graf <tgraf@redhat.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
由 Thomas Graf 提交于
This patch adds a new netfilter target which creates audit records for packets traversing a certain chain. It can be used to record packets which are rejected administraively as follows: -N AUDIT_DROP -A AUDIT_DROP -j AUDIT --type DROP -A AUDIT_DROP -j DROP a rule which would typically drop or reject a packet would then invoke the new chain to record packets before dropping them. -j AUDIT_DROP The module is protocol independant and works for iptables, ip6tables and ebtables. The following information is logged: - netfilter hook - packet length - incomming/outgoing interface - MAC src/dst/proto for ethernet packets - src/dst/protocol address for IPv4/IPv6 - src/dst port for TCP/UDP/UDPLITE - icmp type/code Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NThomas Graf <tgraf@redhat.com> Signed-off-by: NPatrick McHardy <kaber@trash.net>
-
- 16 1月, 2011 8 次提交
-
-
由 Grant Likely 提交于
The physical address is never used by the device tree code when allocating memory for unflattening. Change the architecture's alloc hook to return the virutal address instead. Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
-
由 David Howells 提交于
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be added rather than calling do_add_mount() itself. follow_automount() will then do the addition. This slightly complicates things as ->d_automount() normally wants to add the new vfsmount to an expiration list and start an expiration timer. The problem with that is that the vfsmount will be deleted if it has a refcount of 1 and the timer will not repeat if the expiration list is empty. To this end, we require the vfsmount to be returned from d_automount() with a refcount of (at least) 2. One of these refs will be dropped unconditionally. In addition, follow_automount() must get a 3rd ref around the call to do_add_mount() lest it eat a ref and return an error, leaving the mount we have open to being expired as we would otherwise have only 1 ref on it. d_automount() should also add the the vfsmount to the expiration list (by calling mnt_set_expiry()) and start the expiration timer before returning, if this mechanism is to be used. The vfsmount will be unlinked from the expiration list by follow_automount() if do_add_mount() fails. This patch also fixes the call to do_add_mount() for AFS to propagate the mount flags from the parent vfsmount. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call d_manage() directly. d_manage() needs a parameter to indicate that it is in RCU-walk mode as it isn't allowed to sleep if in that mode (but should return -ECHILD instead). autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon accesses it and otherwise request dropping back to ref-walk mode. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Ian Kent 提交于
Increase the autofs module sub-version so we can tell what kernel implementation is being used from user space debug logging. Signed-off-by: NIan Kent <raven@themaw.net> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Make NFS use the new d_automount() dentry operation rather than abusing follow_link() on directories. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com> Acked-by: NIan Kent <raven@themaw.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount point directories. This can be used by fstatat() users to permit the gathering of attributes on an automount point and also prevent mass-automounting of a directory of automount points by ls. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NIan Kent <raven@themaw.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it sleep when it tries to transit away from one of that filesystem's directories during a pathwalk. The operation is keyed off a new dentry flag (DCACHE_MANAGE_TRANSIT). The filesystem is allowed to be selective about which processes it holds and which it permits to continue on or prohibits from transiting from each flagged directory. This will allow autofs to hold up client processes whilst letting its userspace daemon through to maintain the directory or the stuff behind it or mounted upon it. The ->d_manage() dentry operation: int (*d_manage)(struct path *path, bool mounting_here); takes a pointer to the directory about to be transited away from and a flag indicating whether the transit is undertaken by do_add_mount() or do_move_mount() skipping through a pile of filesystems mounted on a mountpoint. It should return 0 if successful and to let the process continue on its way; -EISDIR to prohibit the caller from skipping to overmounted filesystems or automounting, and to use this directory; or some other error code to return to the user. ->d_manage() is called with namespace_sem writelocked if mounting_here is true and no other locks held, so it may sleep. However, if mounting_here is true, it may not initiate or wait for a mount or unmount upon the parameter directory, even if the act is actually performed by userspace. Within fs/namei.c, follow_managed() is extended to check with d_manage() first on each managed directory, before transiting away from it or attempting to automount upon it. follow_down() is renamed follow_down_one() and should only be used where the filesystem deliberately intends to avoid management steps (e.g. autofs). A new follow_down() is added that incorporates the loop done by all other callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS and CIFS do use it, their use is removed by converting them to use d_automount()). The new follow_down() calls d_manage() as appropriate. It also takes an extra parameter to indicate if it is being called from mount code (with namespace_sem writelocked) which it passes to d_manage(). follow_down() ignores automount points so that it can be used to mount on them. __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have that determine whether to abort or not itself. That would allow the autofs daemon to continue on in rcu-walk mode. Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't required as every tranist from that directory will cause d_manage() to be invoked. It can always be set again when necessary. ========================== WHAT THIS MEANS FOR AUTOFS ========================== Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to trigger the automounting of indirect mounts, and both of these can be called with i_mutex held. autofs knows that the i_mutex will be held by the caller in lookup(), and so can drop it before invoking the daemon - but this isn't so for d_revalidate(), since the lock is only held on _some_ of the code paths that call it. This means that autofs can't risk dropping i_mutex from its d_revalidate() function before it calls the daemon. The bug could manifest itself as, for example, a process that's trying to validate an automount dentry that gets made to wait because that dentry is expired and needs cleaning up: mkdir S ffffffff8014e05a 0 32580 24956 Call Trace: [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f [<ffffffff80057a2f>] lookup_create+0x46/0x80 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4 versus the automount daemon which wants to remove that dentry, but can't because the normal process is holding the i_mutex lock: automount D ffffffff8014e05a 0 32581 1 32561 Call Trace: [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14 [<ffffffff800e6d55>] do_rmdir+0x77/0xde [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 which means that the system is deadlocked. This patch allows autofs to hold up normal processes whilst the daemon goes ahead and does things to the dentry tree behind the automouter point without risking a deadlock as almost no locks are held in d_manage() and none in d_automount(). Signed-off-by: NDavid Howells <dhowells@redhat.com> Was-Acked-by: NIan Kent <raven@themaw.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Add a dentry op (d_automount) to handle automounting directories rather than abusing the follow_link() inode operation. The operation is keyed off a new dentry flag (DCACHE_NEED_AUTOMOUNT). This also makes it easier to add an AT_ flag to suppress terminal segment automount during pathwalk and removes the need for the kludge code in the pathwalk algorithm to handle directories with follow_link() semantics. The ->d_automount() dentry operation: struct vfsmount *(*d_automount)(struct path *mountpoint); takes a pointer to the directory to be mounted upon, which is expected to provide sufficient data to determine what should be mounted. If successful, it should return the vfsmount struct it creates (which it should also have added to the namespace using do_add_mount() or similar). If there's a collision with another automount attempt, NULL should be returned. If the directory specified by the parameter should be used directly rather than being mounted upon, -EISDIR should be returned. In any other case, an error code should be returned. The ->d_automount() operation is called with no locks held and may sleep. At this point the pathwalk algorithm will be in ref-walk mode. Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is added to handle mountpoints. It will return -EREMOTE if the automount flag was set, but no d_automount() op was supplied, -ELOOP if we've encountered too many symlinks or mountpoints, -EISDIR if the walk point should be used without mounting and 0 if successful. The path will be updated to point to the mounted filesystem if a successful automount took place. __follow_mount() is replaced by follow_managed() which is more generic (especially with the patch that adds ->d_manage()). This handles transits from directories during pathwalk, including automounting and skipping over mountpoints (and holding processes with the next patch). __follow_mount_rcu() will jump out of RCU-walk mode if it encounters an automount point with nothing mounted on it. follow_dotdot*() does not handle automounts as you don't want to trigger them whilst following "..". I've also extracted the mount/don't-mount logic from autofs4 and included it here. It makes the mount go ahead anyway if someone calls open() or creat(), tries to traverse the directory, tries to chdir/chroot/etc. into the directory, or sticks a '/' on the end of the pathname. If they do a stat(), however, they'll only trigger the automount if they didn't also say O_NOFOLLOW. I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their inodes as automount points. This flag is automatically propagated to the dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could save AFS a private flag bit apiece, but is not strictly necessary. It would be preferable to do the propagation in d_set_d_op(), but that doesn't normally have access to the inode. [AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu() succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after that, rather than just returning with ungrabbed *path] Signed-off-by: NDavid Howells <dhowells@redhat.com> Was-Acked-by: NIan Kent <raven@themaw.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 15 1月, 2011 3 次提交
-
-
由 Dan Williams 提交于
drivers/dma/amba-pl08x.c: In function 'pl08x_start_txd': drivers/dma/amba-pl08x.c:205: warning: dereferencing 'void *' pointer We never dereference llis_va aside from assigning it to a struct pl08x_lli pointer or calculating the address of array element 0. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Russell King - ARM Linux 提交于
desc->tx_submit's return type is dma_cookie_t, not int. Therefore, dmaengine_submit() should match this return type as it's just wrapping this detail. Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dave Airlie 提交于
This reverts commit dfe63bb0. This commit was causing nouveau not to work properly, for -rc1 I'd prefer it worked and we can look if this is useful for 2.6.39. Cc: James Simmons <jsimmons@infradead.org> Signed-off-by: NDave Airlie <airlied@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-