Data Stats¶

Data statistics and analysis utilities.

`WordSequenceStats` `dataclass` ¶

Statistics for a tokenized sequence.

Attributes:

Name	Type	Description
`dataset_name`	`str`	Name of the dataset.
`subset_hierarchy`	`list[str]`	Subset hierarchy of the split.
`split_name`	`str`	Name of the data split.
`language`	`str`	Language code.
`sequence_id`	`bytes`	Unique identifier for the sequence.
`document_index`	`int`	Index of the document in the dataset.
`document_position`	`int`	Position index of a sentence or paragraph within the document.
`total_token_count`	`int`	Frequency of all tokens.
`number_count`	`int`	Frequency of numeric tokens.
`punctuation_count`	`int`	Frequency of punctuation tokens.
`word_count`	`int`	Frequency of words (excluding numbers and punctuation).

Source code in meld/data_stats.py

@dataclass(slots=True)
class WordSequenceStats:
    """
    Statistics for a tokenized sequence.

    Attributes:
        dataset_name: Name of the dataset.
        subset_hierarchy: Subset hierarchy of the split.
        split_name: Name of the data split.
        language: Language code.
        sequence_id: Unique identifier for the sequence.
        document_index: Index of the document in the dataset.
        document_position: Position index of a sentence or paragraph
            within the document.
        total_token_count: Frequency of all tokens.
        number_count: Frequency of numeric tokens.
        punctuation_count: Frequency of punctuation tokens.
        word_count: Frequency of words (excluding numbers and
            punctuation).
    """

    dataset_name: str
    subset_hierarchy: list[str]
    split_name: str
    language: str
    sequence_id: bytes
    document_index: int
    document_position: int

    total_token_count: int
    number_count: int
    punctuation_count: int
    word_count: int

`word_tokenize_dataset(dataset, workers=12)` ¶

Tokenizes all sequences in a dataset and yields sequence-level word frequency statistics.

Processes each subset and split, skipping ignored languages (e.g. MULTI). Logs progress after each subset.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	Dataset to tokenize.	required
`workers`	`int`	Number of workers to use for parallel word tokenization. Note that a high worker count will increase memory consumption substantially for some tokenizers	`12`

Yields:

Type	Description
`WordSequenceStats`	WordSequenceStats for each sequence.

Source code in meld/data_stats.py

def word_tokenize_dataset(dataset: Dataset, workers: int = 12) -> Iterator[WordSequenceStats]:
    """
    Tokenizes all sequences in a dataset and yields sequence-level word frequency statistics.

    Processes each subset and split, skipping ignored languages (e.g. MULTI). Logs progress after each subset.

    Args:
        dataset: Dataset to tokenize.
        workers: Number of workers to use for parallel word tokenization.
            Note that a high worker count will increase memory consumption substantially for some tokenizers

    Yields:
        WordSequenceStats for each sequence.
    """

    for subset in dataset:
        for split in subset.splits:
            if subset.metadata.language in _IGNORED_LANGUAGES:
                continue

            stats = _collect_split_token_stats(
                dataset.metadata.name,
                subset,
                split,
                workers,
            )

            yield from stats

        logger.info(f"Processed subset {dataset.metadata.name}->{'->'.join(subset.hierarchy)}")

Data Stats¶

WordSequenceStats dataclass ¶

word_tokenize_dataset(dataset, workers=12) ¶

`WordSequenceStats` `dataclass` ¶

`word_tokenize_dataset(dataset, workers=12)` ¶