Skip to content

Fastai transforms

I have directly copied and pasted part of the transforms.py module from the fastai library. The reason to do such a thing is because pytorch_widedeep only needs the Tokenizer and the Vocab classes there. This way I avoid extra dependencies. Credit for all the code in the fastai_transforms module in this pytorch-widedeep package goes to Jeremy Howard and the fastai team. I only include the documentation here for completion, but I strongly advise the user to read the fastai documentation.

Tokenizer

Tokenizer(
    tok_func=SpacyTokenizer,
    lang="en",
    pre_rules=None,
    post_rules=None,
    special_cases=None,
    n_cpus=None,
)

Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.

Setting some of the parameters of this class require perhaps some familiarity with the source code.

Parameters:

  • tok_func (Callable) –

    Tokenizer Object. See pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer

  • lang (str) –

    Text's Language

  • pre_rules (Optional[ListRules]) –

    Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the text (str) directly as rule(tok) before being tokenized.

  • post_rules (Optional[ListRules]) –

    Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the tokens as rule(tokens) after the text has been tokenized.

  • special_cases (Optional[Collection[str]]) –

    special cases to be added to the tokenizer via Spacy's add_special_case method

  • n_cpus (Optional[int]) –

    number of CPUs to used during the tokenization process

Source code in pytorch_widedeep/utils/fastai_transforms.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
def __init__(
    self,
    tok_func: Callable = SpacyTokenizer,
    lang: str = "en",
    pre_rules: Optional[ListRules] = None,
    post_rules: Optional[ListRules] = None,
    special_cases: Optional[Collection[str]] = None,
    n_cpus: Optional[int] = None,
):
    self.tok_func, self.lang, self.special_cases = tok_func, lang, special_cases
    self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules)
    self.post_rules = ifnone(post_rules, defaults.text_post_rules)
    self.special_cases = (
        special_cases if special_cases is not None else defaults.text_spec_tok
    )
    self.n_cpus = ifnone(n_cpus, defaults.cpus)

process_text

process_text(t, tok)

Process and tokenize one text t with tokenizer tok.

Parameters:

  • t (str) –

    text to be processed and tokenized

  • tok (BaseTokenizer) –

    Instance of BaseTokenizer. See pytorch_widedeep.utils.fastai_transforms.BaseTokenizer

Returns:

  • List[str]

    List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
    r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

    Parameters
    ----------
    t: str
        text to be processed and tokenized
    tok: ``BaseTokenizer``
        Instance of `BaseTokenizer`. See
        `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

    Returns
    -------
    List[str]
        List of tokens
    """
    for rule in self.pre_rules:
        t = rule(t)
    toks = tok.tokenizer(t)
    for rule in self.post_rules:
        toks = rule(toks)
    return toks

process_all

process_all(texts)

Process a list of texts. Parallel execution of process_text.

Examples:

>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

ℹ️ NOTE: Note the token TK_MAJ (xxmaj), used to indicate the next word begins with a capital in the original text. For more details of special tokens please see the fastai docs.

Returns:

  • List[List[str]]

    List containing lists of tokens. One list per "document"

Source code in pytorch_widedeep/utils/fastai_transforms.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
def process_all(self, texts: Collection[str]) -> List[List[str]]:
    r"""Process a list of texts. Parallel execution of ``process_text``.

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tok = Tokenizer()
    >>> tok.process_all(texts)
    [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

    :information_source: **NOTE**:
    Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
    next word begins with a capital in the original text. For more
    details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    List[List[str]]
        List containing lists of tokens. One list per "_document_"

    """

    if self.n_cpus <= 1:
        return self._process_all_1(texts)
    with ProcessPoolExecutor(self.n_cpus) as e:
        return sum(
            e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), []
        )

Vocab

Vocab(itos)

Contains the correspondence between numbers and tokens.

Parameters:

  • itos (Collection[str]) –

    index to str. Collection of strings that are the tokens of the vocabulary

Attributes:

  • stoi (defaultdict) –

    str to index. Dictionary containing the tokens of the vocabulary and their corresponding index

Source code in pytorch_widedeep/utils/fastai_transforms.py
358
359
360
def __init__(self, itos: Collection[str]):
    self.itos = itos
    self.stoi = defaultdict(int, {v: k for k, v in enumerate(self.itos)})

numericalize

numericalize(t)

Convert a list of tokens t to their ids.

Returns:

  • List[int]

    List of 'numericalsed' tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
362
363
364
365
366
367
368
369
370
def numericalize(self, t: Collection[str]) -> List[int]:
    """Convert a list of tokens ``t`` to their ids.

    Returns
    -------
    List[int]
        List of '_numericalsed_' tokens
    """
    return [self.stoi[w] for w in t]

textify

textify(nums, sep=' ')

Convert a list of nums (or indexes) to their tokens.

Returns:

  • List[str]

    List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py
372
373
374
375
376
377
378
379
380
def textify(self, nums: Collection[int], sep=" ") -> List[str]:
    """Convert a list of ``nums`` (or indexes) to their tokens.

    Returns
    -------
    List[str]
        List of tokens
    """
    return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]  # type: ignore

save

save(path)

Save the attribute self.itos in path

Source code in pytorch_widedeep/utils/fastai_transforms.py
389
390
391
def save(self, path):
    """Save the  attribute ``self.itos`` in ``path``"""
    pickle.dump(self.itos, open(path, "wb"))

create classmethod

create(tokens, max_vocab, min_freq, pad_idx=None)

Create a vocabulary object from a set of tokens.

Parameters:

  • tokens (Tokens) –

    Custom type: Collection[Collection[str]] see pytorch_widedeep.wdtypes. Collection of collection of strings (e.g. list of tokenized sentences)

  • max_vocab (int) –

    maximum vocabulary size

  • pad_idx (Optional[int]) –

    padding index. If None, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').

Examples:

>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab.create(tokens, max_vocab=18, min_freq=1)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'

ℹ️ NOTE: Note the many special tokens that fastai's' tokenizer adds. These are particularly useful when building Language models and/or in classification/Regression tasks. Please see the fastai docs.

Returns:

  • Vocab

    An instance of a Vocab object

Source code in pytorch_widedeep/utils/fastai_transforms.py
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
@classmethod
def create(
    cls,
    tokens: Tokens,
    max_vocab: int,
    min_freq: int,
    pad_idx: Optional[int] = None,
) -> "Vocab":
    r"""Create a vocabulary object from a set of tokens.

    Parameters
    ----------
    tokens: Tokens
        Custom type: ``Collection[Collection[str]]``  see
        `pytorch_widedeep.wdtypes`. Collection of collection of
        strings (e.g. list of tokenized sentences)
    max_vocab: int
        maximum vocabulary size
    pad_idx: int, Optional, default = None
        padding index. If `None`, Fastai's Tokenizer leaves the 0 index
        for the unknown token (_'xxunk'_) and defaults to 1 for the padding
        token (_'xxpad'_).

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer, Vocab
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tokens = Tokenizer().process_all(texts)
    >>> vocab = Vocab.create(tokens, max_vocab=18, min_freq=1)
    >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
    [10, 11, 9, 12]
    >>> vocab.textify([10, 11, 9, 12])
    'machine learning is great'

    :information_source: **NOTE**:
    Note the many special tokens that ``fastai``'s' tokenizer adds. These
    are particularly useful when building Language models and/or in
    classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    Vocab
        An instance of a `Vocab` object
    """

    freq = Counter(p for o in tokens for p in o)
    itos = [o for o, c in freq.most_common(max_vocab) if c >= min_freq]
    for o in reversed(defaults.text_spec_tok):
        if o in itos:
            itos.remove(o)
        itos.insert(0, o)

    if pad_idx is not None:
        itos.remove(PAD)
        itos.insert(pad_idx, PAD)

    itos = itos[:max_vocab]
    if (
        len(itos) < max_vocab
    ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
        while len(itos) % 8 != 0:
            itos.append("xxfake")
    return cls(itos)

load classmethod

load(path)

Load an intance of :obj:Vocab contained in path

Source code in pytorch_widedeep/utils/fastai_transforms.py
457
458
459
460
461
@classmethod
def load(cls, path):
    """Load an intance of :obj:`Vocab` contained in ``path``"""
    itos = pickle.load(open(path, "rb"))
    return cls(itos)