1. 15 3月, 2011 1 次提交
  2. 25 2月, 2011 1 次提交
  3. 22 2月, 2011 1 次提交
  4. 16 2月, 2011 1 次提交
    • P
      ipvs: make "no destination available" message more informative · 41ac51ee
      Patrick Schaaf 提交于
      When IP_VS schedulers do not find a destination, they output a terse
      "WLC: no destination available" message through kernel syslog, which I
      can not only make sense of because syslog puts them in a logfile
      together with keepalived checker results.
      
      This patch makes the output a bit more informative, by telling you which
      virtual service failed to find a destination.
      
      Example output:
      
      kernel: [1539214.552233] IPVS: wlc: TCP 192.168.8.30:22 - no destination available
      kernel: [1539299.674418] IPVS: wlc: FWM 22 0x00000016 - no destination available
      
      I have tested the code for IPv4 and FWM services, as you can see from
      the example; I do not have an IPv6 setup to test the third code path
      with.
      
      To avoid code duplication, I put a new function ip_vs_scheduler_err()
      into ip_vs_sched.c, and use that from the schedulers instead of calling
      IP_VS_ERR_RL directly.
      Signed-off-by: NPatrick Schaaf <netdev@bof.de>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      41ac51ee
  5. 03 2月, 2011 2 次提交
  6. 02 2月, 2011 2 次提交
  7. 01 2月, 2011 6 次提交
    • J
      netfilter: xtables: "set" match and "SET" target support · d956798d
      Jozsef Kadlecsik 提交于
      The patch adds the combined module of the "SET" target and "set" match
      to netfilter. Both the previous and the current revisions are supported.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      d956798d
    • J
      netfilter: ipset: list:set set type support · f830837f
      Jozsef Kadlecsik 提交于
      The module implements the list:set type support in two flavours:
      without and with timeout. The sets has two sides: for the userspace,
      they store the names of other (non list:set type of) sets: one can add,
      delete and test set names. For the kernel, it forms an ordered union of
      the member sets: the members sets are tried in order when elements are
      added, deleted and tested and the process stops at the first success.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f830837f
    • J
      netfilter: ipset: hash:ip set type support · 6c027889
      Jozsef Kadlecsik 提交于
      The module implements the hash:ip type support in four flavours:
      for IPv4 or IPv6, both without and with timeout support.
      
      All the hash types are based on the "array hash" or ahash structure
      and functions as a good compromise between minimal memory footprint
      and speed. The hashing uses arrays to resolve clashes. The hash table
      is resized (doubled) when searching becomes too long. Resizing can be
      triggered by userspace add commands only and those are serialized by
      the nfnl mutex. During resizing the set is read-locked, so the only
      possible concurrent operations are the kernel side readers. Those are
      protected by RCU locking.
      
      Because of the four flavours and the other hash types, the functions
      are implemented in general forms in the ip_set_ahash.h header file
      and the real functions are generated before compiling by macro expansion.
      Thus the dereferencing of low-level functions and void pointer arguments
      could be avoided: the low-level functions are inlined, the function
      arguments are pointers of type-specific structures.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      6c027889
    • J
      netfilter: ipset: bitmap:ip set type support · 72205fc6
      Jozsef Kadlecsik 提交于
      The module implements the bitmap:ip set type in two flavours, without
      and with timeout support. In this kind of set one can store IPv4
      addresses (or network addresses) from a given range.
      
      In order not to waste memory, the timeout version does not rely on
      the kernel timer for every element to be timed out but on garbage
      collection. All set types use this mechanism.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      72205fc6
    • J
      netfilter: ipset: IP set core support · a7b4f989
      Jozsef Kadlecsik 提交于
      The patch adds the IP set core support to the kernel.
      
      The IP set core implements a netlink (nfnetlink) based protocol by which
      one can create, destroy, flush, rename, swap, list, save, restore sets,
      and add, delete, test elements from userspace. For simplicity (and backward
      compatibilty and for not to force ip(6)tables to be linked with a netlink
      library) reasons a small getsockopt-based protocol is also kept in order
      to communicate with the ip(6)tables match and target.
      
      The netlink protocol passes all u16, etc values in network order with
      NLA_F_NET_BYTEORDER flag. The protocol enforces the proper use of the
      NLA_F_NESTED and NLA_F_NET_BYTEORDER flags.
      
      For other kernel subsystems (netfilter match and target) the API contains
      the functions to add, delete and test elements in sets and the required calls
      to get/put refereces to the sets before those operations can be performed.
      
      The set types (which are implemented in independent modules) are stored
      in a simple RCU protected list. A set type may have variants: for example
      without timeout or with timeout support, for IPv4 or for IPv6. The sets
      (i.e. the pointers to the sets) are stored in an array. The sets are
      identified by their index in the array, which makes possible easy and
      fast swapping of sets. The array is protected indirectly by the nfnl
      mutex from nfnetlink. The content of the sets are protected by the rwlock
      of the set.
      
      There are functional differences between the add/del/test functions
      for the kernel and userspace:
      
      - kernel add/del/test: works on the current packet (i.e. one element)
      - kernel test: may trigger an "add" operation  in order to fill
        out unspecified parts of the element from the packet (like MAC address)
      - userspace add/del: works on the netlink message and thus possibly
        on multiple elements from the IPSET_ATTR_ADT container attribute.
      - userspace add: may trigger resizing of a set
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      a7b4f989
    • J
      netfilter: NFNL_SUBSYS_IPSET id and NLA_PUT_NET* macros · f703651e
      Jozsef Kadlecsik 提交于
      The patch adds the NFNL_SUBSYS_IPSET id and NLA_PUT_NET* macros to the
      vanilla kernel.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f703651e
  8. 21 1月, 2011 2 次提交
  9. 20 1月, 2011 6 次提交
    • J
      netfilter: xtables: remove duplicate member · ba12b130
      Jan Engelhardt 提交于
      Accidentally missed removing the old out-of-union "inverse" member,
      which caused the struct size to change which then gives size mismatch
      warnings when using an old iptables.
      
      It is interesting to see that gcc did not warn about this before.
      (Filed http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47376 )
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      ba12b130
    • J
      netfilter: xtables: remove extraneous header that slipped in · 5d844928
      Jan Engelhardt 提交于
      Commit 0b8ad876 (netfilter: xtables: add missing header files to export
      list) erroneously added this.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      5d844928
    • J
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend 提交于
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8970f0b
    • J
      net: implement mechanism for HW based QOS · 4f57c087
      John Fastabend 提交于
      This patch provides a mechanism for lower layer devices to
      steer traffic using skb->priority to tx queues. This allows
      for hardware based QOS schemes to use the default qdisc without
      incurring the penalties related to global state and the qdisc
      lock. While reliably receiving skbs on the correct tx ring
      to avoid head of line blocking resulting from shuffling in
      the LLD. Finally, all the goodness from txq caching and xps/rps
      can still be leveraged.
      
      Many drivers and hardware exist with the ability to implement
      QOS schemes in the hardware but currently these drivers tend
      to rely on firmware to reroute specific traffic, a driver
      specific select_queue or the queue_mapping action in the
      qdisc.
      
      By using select_queue for this drivers need to be updated for
      each and every traffic type and we lose the goodness of much
      of the upstream work. Firmware solutions are inherently
      inflexible. And finally if admins are expected to build a
      qdisc and filter rules to steer traffic this requires knowledge
      of how the hardware is currently configured. The number of tx
      queues and the queue offsets may change depending on resources.
      Also this approach incurs all the overhead of a qdisc with filters.
      
      With the mechanism in this patch users can set skb priority using
      expected methods ie setsockopt() or the stack can set the priority
      directly. Then the skb will be steered to the correct tx queues
      aligned with hardware QOS traffic classes. In the normal case with
      single traffic class and all queues in this class everything
      works as is until the LLD enables multiple tcs.
      
      To steer the skb we mask out the lower 4 bits of the priority
      and allow the hardware to configure upto 15 distinct classes
      of traffic. This is expected to be sufficient for most applications
      at any rate it is more then the 8021Q spec designates and is
      equal to the number of prio bands currently implemented in
      the default qdisc.
      
      This in conjunction with a userspace application such as
      lldpad can be used to implement 8021Q transmission selection
      algorithms one of these algorithms being the extended transmission
      selection algorithm currently being used for DCB.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f57c087
    • V
      net_device: add support for network device groups · cbda10fa
      Vlad Dogaru 提交于
      Net devices can now be grouped, enabling simpler manipulation from
      userspace. This patch adds a group field to the net_device structure, as
      well as rtnetlink support to query and modify it.
      Signed-off-by: NVlad Dogaru <ddvlad@rosedu.org>
      Acked-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbda10fa
    • J
      netfilter: xtables: connlimit revision 1 · cc4fc022
      Jan Engelhardt 提交于
      This adds destination address-based selection. The old "inverse"
      member is overloaded (memory-wise) with a new "flags" variable,
      similar to how J.Park did it with xt_string rev 1. Since revision 0
      userspace only sets flag 0x1, no great changes are made to explicitly
      test for different revisions.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      cc4fc022
  10. 19 1月, 2011 3 次提交
    • P
      netfilter: nf_conntrack_tstamp: add flow-based timestamp extension · a992ca2a
      Pablo Neira Ayuso 提交于
      This patch adds flow-based timestamping for conntracks. This
      conntrack extension is disabled by default. Basically, we use
      two 64-bits variables to store the creation timestamp once the
      conntrack has been confirmed and the other to store the deletion
      time. This extension is disabled by default, to enable it, you
      have to:
      
      echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp
      
      This patch allows to save memory for user-space flow-based
      loogers such as ulogd2. In short, ulogd2 does not need to
      keep a hashtable with the conntrack in user-space to know
      when they were created and destroyed, instead we use the
      kernel timestamp. If we want to have a sane IPFIX implementation
      in user-space, this nanosecs resolution timestamps are also
      useful. Other custom user-space applications can benefit from
      this via libnetfilter_conntrack.
      
      This patch modifies the /proc output to display the delta time
      in seconds since the flow start. You can also obtain the
      flow-start date by means of the conntrack-tools.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      a992ca2a
    • E
      net: filter: dont block softirqs in sk_run_filter() · 80f8f102
      Eric Dumazet 提交于
      Packet filter (BPF) doesnt need to disable softirqs, being fully
      re-entrant and lock-less.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80f8f102
    • J
      netfilter: nf_conntrack: nf_conntrack snmp helper · 93557f53
      Jiri Olsa 提交于
      Adding support for SNMP broadcast connection tracking. The SNMP
      broadcast requests are now paired with the SNMP responses.
      Thus allowing using SNMP broadcasts with firewall enabled.
      
      Please refer to the following conversation:
      http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
      
      Patrick McHardy wrote:
      > > The best solution would be to add generic broadcast tracking, the
      > > use of expectations for this is a bit of abuse.
      > > The second best choice I guess would be to move the help() function
      > > to a shared module and generalize it so it can be used for both.
      This patch implements the "second best choice".
      
      Since the netbios-ns conntrack module uses the same helper
      functionality as the snmp, only one helper function is added
      for both snmp and netbios-ns modules into the new object -
      nf_conntrack_broadcast.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      93557f53
  11. 18 1月, 2011 6 次提交
  12. 17 1月, 2011 9 次提交
    • N
      fs: fix address space warnings in ioctl_fiemap() · ecf5632d
      Namhyung Kim 提交于
      The fi_extents_start field of struct fiemap_extent_info is a
      user pointer but was not marked as __user. This makes sparse
      emit following warnings:
      
        CHECK   fs/ioctl.c
      fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
      fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
      fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
      fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
      fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
      fs/ioctl.c:212:27:    got char *<noident>
      
      Also add 'ufiemap' variable to eliminate unnecessary casts.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf5632d
    • S
      fs: Remove unlikely() from fput_light() · c2b3e74b
      Steven Rostedt 提交于
      In fput_light(), there's an unlikely(fput_needed), which running on
      my normal desktop doing firefox, xchat, evolution and part of my distcc farm,
      and running the annotate branch profiler shows that the unlikely is not
      very unlikely.
      
       correct incorrect  %        Function             File              Line
       ------- ---------  -        --------             ----              ----
             0       48 100 fput_light                file.h               26
      115828710 897415279  88 fput_light              file.h               26
      865271179 5286128445  85 fput_light             file.h               26
      19568539  8923664  31 fput_light                file.h               26
      12353677  3562279  22 fput_light                file.h               26
        267691    67062  20 fput_light                file.h               26
      15014853   348172   2 fput_light                file.h               26
        209258      205   0 fput_light                file.h               26
       1364164        0   0 fput_light                file.h               26
      
      Which gives 1032903812 times it was correct and 6203351846 times it was
      incorrect, or 85% incorrect.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2b3e74b
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • T
      RDMA: Update workqueue usage · f0626710
      Tejun Heo 提交于
      * ib_wq is added, which is used as the common workqueue for infiniband
        instead of the system workqueue.  All system workqueue usages
        including flush_scheduled_work() callers are converted to use and
        flush ib_wq.
      
      * cancel_delayed_work() + flush_scheduled_work() converted to
        cancel_delayed_work_sync().
      
      * qib_wq is removed and ib_wq is used instead.
      
      This is to prepare for deprecation of flush_scheduled_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f0626710
    • R
      ARM: PL08x: cleanup comments · 94ae8522
      Russell King - ARM Linux 提交于
      Cleanup the formatting of comments, remove some which don't make sense
      anymore.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      [fix conflict with 96a608a4]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      94ae8522
    • A
      fix non-x86 build failure in pmdp_get_and_clear · b3697c02
      Andrea Arcangeli 提交于
      pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
      BUG() and they were defined only to diminish the risk of build issues on
      not-x86 archs and to be consistent with the generic pte methods previously
      defined in include/asm-generic/pgtable.h.
      
      But they are causing more trouble than they were supposed to solve, so
      it's simpler not to define them when THP is off.
      
      This is also correcting the export of pmdp_splitting_flush which is
      currently unused (x86 isn't using the generic implementation in
      mm/pgtable-generic.c and no other arch needs that [yet]).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3697c02
    • R
      PCI / ACPI: Fix build of the AER driver for CONFIG_ACPI unset · fc8fe1e9
      Rafael J. Wysocki 提交于
      After commit 415e12b2 ("PCI/ACPI: Request _OSC control once for each
      root bridge (v3)") include/linux/pci-acpi.h is included by
      drivers/pci/pcie/aer/aerdrv.c and if CONFIG_ACPI is unset, the bogus and
      unnecessary alternative definition of acpi_find_root_bridge_handle()
      causes a build error to occur.
      
      Remove the offending piece of garbage.
      Reported-and-tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc8fe1e9
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
    • T
      netfilter: create audit records for x_tables replaces · fbabf31e
      Thomas Graf 提交于
      The setsockopt() syscall to replace tables is already recorded
      in the audit logs. This patch stores additional information
      such as table name and netfilter protocol.
      
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      fbabf31e