Fastai transforms¶

I have directly copied and pasted part of the transforms.py module from the fastai library. The reason to do such a thing is because pytorch_widedeep only needs the Tokenizer and the Vocab classes there. This way I avoid extra dependencies. Credit for all the code in the fastai_transforms module in this pytorch-widedeep package goes to Jeremy Howard and the fastai team. I only include the documentation here for completion, but I strongly advise the user to read the fastai documentation.

Tokenizer ¶

Tokenizer(
    tok_func=SpacyTokenizer,
    lang="en",
    pre_rules=None,
    post_rules=None,
    special_cases=None,
    n_cpus=None,
)

Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.

Setting some of the parameters of this class require perhaps some familiarity with the source code.

Parameters:

tok_func (Callable) –
Tokenizer Object. See pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer
lang (str) –
Text's Language
pre_rules (Optional[ListRules]) –
Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the text (str) directly as rule(tok) before being tokenized.
post_rules (Optional[ListRules]) –
Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the tokens as rule(tokens) after the text has been tokenized.
special_cases (Optional[Collection[str]]) –
special cases to be added to the tokenizer via Spacy's add_special_case method
n_cpus (Optional[int]) –
number of CPUs to used during the tokenization process

Source code in pytorch_widedeep/utils/fastai_transforms.py

def __init__(
    self,
    tok_func: Callable = SpacyTokenizer,
    lang: str = "en",
    pre_rules: Optional[ListRules] = None,
    post_rules: Optional[ListRules] = None,
    special_cases: Optional[Collection[str]] = None,
    n_cpus: Optional[int] = None,
):
    self.tok_func, self.lang, self.special_cases = tok_func, lang, special_cases
    self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules)
    self.post_rules = ifnone(post_rules, defaults.text_post_rules)
    self.special_cases = (
        special_cases if special_cases is not None else defaults.text_spec_tok
    )
    self.n_cpus = ifnone(n_cpus, defaults.cpus)

process_text ¶

process_text(t, tok)

Process and tokenize one text t with tokenizer tok.

Parameters:

t (str) –
text to be processed and tokenized
tok (BaseTokenizer) –
Instance of BaseTokenizer. See pytorch_widedeep.utils.fastai_transforms.BaseTokenizer

Returns:

List[str] –
List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
    r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

    Parameters
    ----------
    t: str
        text to be processed and tokenized
    tok: ``BaseTokenizer``
        Instance of `BaseTokenizer`. See
        `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

    Returns
    -------
    List[str]
        List of tokens
    """
    for rule in self.pre_rules:
        t = rule(t)
    toks = tok.tokenizer(t)
    for rule in self.post_rules:
        toks = rule(toks)
    return toks

process_all ¶

process_all(texts)

Process a list of texts. Parallel execution of process_text.

Examples:

>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

NOTE: Note the token TK_MAJ (xxmaj), used to indicate the next word begins with a capital in the original text. For more details of special tokens please see the fastai docs.

Returns:

List[List[str]] –
List containing lists of tokens. One list per "document"

Source code in pytorch_widedeep/utils/fastai_transforms.py

def process_all(self, texts: Collection[str]) -> List[List[str]]:
    r"""Process a list of texts. Parallel execution of ``process_text``.

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tok = Tokenizer()
    >>> tok.process_all(texts)
    [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

    :information_source: **NOTE**:
    Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
    next word begins with a capital in the original text. For more
    details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    List[List[str]]
        List containing lists of tokens. One list per "_document_"

    """

    if self.n_cpus <= 1:
        return self._process_all_1(texts)
    with ProcessPoolExecutor(self.n_cpus) as e:
        return sum(
            e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), []
        )

Vocab ¶

Vocab(itos)

Contains the correspondence between numbers and tokens.

Parameters:

itos (Collection[str]) –
index to str. Collection of strings that are the tokens of the vocabulary

Attributes:

stoi (defaultdict) –
str to index. Dictionary containing the tokens of the vocabulary and their corresponding index

Source code in pytorch_widedeep/utils/fastai_transforms.py

def __init__(self, itos: Collection[str]):
    self.itos = itos
    self.stoi = defaultdict(int, {v: k for k, v in enumerate(self.itos)})

numericalize ¶

numericalize(t)

Convert a list of tokens t to their ids.

Returns:

List[int] –
List of 'numericalsed' tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def numericalize(self, t: Collection[str]) -> List[int]:
    """Convert a list of tokens ``t`` to their ids.

    Returns
    -------
    List[int]
        List of '_numericalsed_' tokens
    """
    return [self.stoi[w] for w in t]

textify ¶

textify(nums, sep=' ')

Convert a list of nums (or indexes) to their tokens.

Returns:

List[str] –
List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def textify(self, nums: Collection[int], sep=" ") -> List[str]:
    """Convert a list of ``nums`` (or indexes) to their tokens.

    Returns
    -------
    List[str]
        List of tokens
    """
    return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]  # type: ignore

save ¶

save(path)

Save the attribute self.itos in path

Source code in pytorch_widedeep/utils/fastai_transforms.py

def save(self, path):
    """Save the  attribute ``self.itos`` in ``path``"""
    pickle.dump(self.itos, open(path, "wb"))

create `classmethod` ¶

create(tokens, max_vocab, min_freq, pad_idx=None)

Create a vocabulary object from a set of tokens.

Parameters:

tokens (Tokens) –
Custom type: Collection[Collection[str]] see pytorch_widedeep.wdtypes. Collection of collection of strings (e.g. list of tokenized sentences)
max_vocab (int) –
maximum vocabulary size
pad_idx (Optional[int]) –
padding index. If None, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').

Examples:

>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab.create(tokens, max_vocab=18, min_freq=1)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'

NOTE: Note the many special tokens that fastai's' tokenizer adds. These are particularly useful when building Language models and/or in classification/Regression tasks. Please see the fastai docs.

Returns:

Vocab –
An instance of a Vocab object

Source code in pytorch_widedeep/utils/fastai_transforms.py

@classmethod
def create(
    cls,
    tokens: Tokens,
    max_vocab: int,
    min_freq: int,
    pad_idx: Optional[int] = None,
) -> "Vocab":
    r"""Create a vocabulary object from a set of tokens.

    Parameters
    ----------
    tokens: Tokens
        Custom type: ``Collection[Collection[str]]``  see
        `pytorch_widedeep.wdtypes`. Collection of collection of
        strings (e.g. list of tokenized sentences)
    max_vocab: int
        maximum vocabulary size
    pad_idx: int, Optional, default = None
        padding index. If `None`, Fastai's Tokenizer leaves the 0 index
        for the unknown token (_'xxunk'_) and defaults to 1 for the padding
        token (_'xxpad'_).

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer, Vocab
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tokens = Tokenizer().process_all(texts)
    >>> vocab = Vocab.create(tokens, max_vocab=18, min_freq=1)
    >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
    [10, 11, 9, 12]
    >>> vocab.textify([10, 11, 9, 12])
    'machine learning is great'

    :information_source: **NOTE**:
    Note the many special tokens that ``fastai``'s' tokenizer adds. These
    are particularly useful when building Language models and/or in
    classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    Vocab
        An instance of a `Vocab` object
    """

    freq = Counter(p for o in tokens for p in o)
    itos = [o for o, c in freq.most_common(max_vocab) if c >= min_freq]
    for o in reversed(defaults.text_spec_tok):
        if o in itos:
            itos.remove(o)
        itos.insert(0, o)

    if pad_idx is not None:
        itos.remove(PAD)
        itos.insert(pad_idx, PAD)

    itos = itos[:max_vocab]
    if (
        len(itos) < max_vocab
    ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
        while len(itos) % 8 != 0:
            itos.append("xxfake")
    return cls(itos)

load `classmethod` ¶

load(path)

Load an intance of :obj:Vocab contained in path

Source code in pytorch_widedeep/utils/fastai_transforms.py

@classmethod
def load(cls, path):
    """Load an intance of :obj:`Vocab` contained in ``path``"""
    itos = pickle.load(open(path, "rb"))
    return cls(itos)

Fastai transforms¶

Tokenizer ¶

process_text ¶

process_all ¶

Vocab ¶

numericalize ¶

textify ¶

save ¶

create classmethod ¶

load classmethod ¶

create `classmethod` ¶

load `classmethod` ¶