Skip to content

Data Stats

Data statistics and analysis utilities.

WordSequenceStats dataclass

Statistics for a tokenized sequence.

Attributes:

Name Type Description
dataset_name str

Name of the dataset.

subset_hierarchy list[str]

Subset hierarchy of the split.

split_name str

Name of the data split.

language str

Language code.

sequence_id bytes

Unique identifier for the sequence.

document_index int

Index of the document in the dataset.

document_position int

Position index of a sentence or paragraph within the document.

total_token_count int

Frequency of all tokens.

number_count int

Frequency of numeric tokens.

punctuation_count int

Frequency of punctuation tokens.

word_count int

Frequency of words (excluding numbers and punctuation).

Source code in meld/data_stats.py
@dataclass(slots=True)
class WordSequenceStats:
    """
    Statistics for a tokenized sequence.

    Attributes:
        dataset_name: Name of the dataset.
        subset_hierarchy: Subset hierarchy of the split.
        split_name: Name of the data split.
        language: Language code.
        sequence_id: Unique identifier for the sequence.
        document_index: Index of the document in the dataset.
        document_position: Position index of a sentence or paragraph
            within the document.
        total_token_count: Frequency of all tokens.
        number_count: Frequency of numeric tokens.
        punctuation_count: Frequency of punctuation tokens.
        word_count: Frequency of words (excluding numbers and
            punctuation).
    """

    dataset_name: str
    subset_hierarchy: list[str]
    split_name: str
    language: str
    sequence_id: bytes
    document_index: int
    document_position: int

    total_token_count: int
    number_count: int
    punctuation_count: int
    word_count: int

word_tokenize_dataset(dataset, workers=12)

Tokenizes all sequences in a dataset and yields sequence-level word frequency statistics.

Processes each subset and split, skipping ignored languages (e.g. MULTI). Logs progress after each subset.

Parameters:

Name Type Description Default
dataset Dataset

Dataset to tokenize.

required
workers int

Number of workers to use for parallel word tokenization. Note that a high worker count will increase memory consumption substantially for some tokenizers

12

Yields:

Type Description
WordSequenceStats

WordSequenceStats for each sequence.

Source code in meld/data_stats.py
def word_tokenize_dataset(dataset: Dataset, workers: int = 12) -> Iterator[WordSequenceStats]:
    """
    Tokenizes all sequences in a dataset and yields sequence-level word frequency statistics.

    Processes each subset and split, skipping ignored languages (e.g. MULTI). Logs progress after each subset.

    Args:
        dataset: Dataset to tokenize.
        workers: Number of workers to use for parallel word tokenization.
            Note that a high worker count will increase memory consumption substantially for some tokenizers

    Yields:
        WordSequenceStats for each sequence.
    """

    for subset in dataset:
        for split in subset.splits:
            if subset.metadata.language in _IGNORED_LANGUAGES:
                continue

            stats = _collect_split_token_stats(
                dataset.metadata.name,
                subset,
                split,
                workers,
            )

            yield from stats

        logger.info(f"Processed subset {dataset.metadata.name}->{'->'.join(subset.hierarchy)}")