Data Stats¶
Data statistics and analysis utilities.
WordSequenceStats
dataclass
¶
Statistics for a tokenized sequence.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_name |
str
|
Name of the dataset. |
subset_hierarchy |
list[str]
|
Subset hierarchy of the split. |
split_name |
str
|
Name of the data split. |
language |
str
|
Language code. |
sequence_id |
bytes
|
Unique identifier for the sequence. |
document_index |
int
|
Index of the document in the dataset. |
document_position |
int
|
Position index of a sentence or paragraph within the document. |
total_token_count |
int
|
Frequency of all tokens. |
number_count |
int
|
Frequency of numeric tokens. |
punctuation_count |
int
|
Frequency of punctuation tokens. |
word_count |
int
|
Frequency of words (excluding numbers and punctuation). |
Source code in meld/data_stats.py
word_tokenize_dataset(dataset, workers=12)
¶
Tokenizes all sequences in a dataset and yields sequence-level word frequency statistics.
Processes each subset and split, skipping ignored languages (e.g. MULTI). Logs progress after each subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
Dataset to tokenize. |
required |
workers
|
int
|
Number of workers to use for parallel word tokenization. Note that a high worker count will increase memory consumption substantially for some tokenizers |
12
|
Yields:
| Type | Description |
|---|---|
WordSequenceStats
|
WordSequenceStats for each sequence. |