Data¶
Main entrypoint for data download and management.
available_datasets()
¶
Lists all available datasets.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list containing the names of all datasets included in the |
list[str]
|
package. |
bibliography_entries(datasets=None)
¶
Collects a list of bibliography entries as bibtex strings for the given datasets or MELD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
list[str] | None
|
A list of datasets to collect bibliography entries for or |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of bibtex strings. If |
list[str]
|
only the MELD entry is returned, otherwise all entries for the given list of datasets are returned in order |
Source code in meld/data.py
compute_word_counts(data_directory, output, append=False, workers=12)
¶
Computes word counts using a word tokenizer for each dataset split and writes statistics to a parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_directory
|
Path
|
Directory containing processed benchmark datasets. |
required |
output
|
Path
|
Path to output parquet file for word count statistics. |
required |
append
|
bool
|
Whether to append to an existing output file instead of overwriting. |
False
|
workers
|
int
|
Number of workers to use for parallel word tokenization. Note that a high worker count will increase memory consumption substantially for some tokenizers |
12
|
Source code in meld/data.py
download(data_directory, datasets=None, force_reprocess=False, meld_open_repo='kgnlp/meld-open', sentence_span_path=None)
¶
Downloads NER datasets and processes them into the standardized benchmark format in the specified directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_directory
|
Path
|
Directory where the datasets will be stored. |
required |
datasets
|
Sequence[str] | None
|
List of dataset names and/or profiles to download. Dataset profiles and names may be mixed. For example, ["meld:open", "CoNLL-2003"] will download all datasets in the "meld:open" list and "CoNLL-2003". If |
None
|
force_reprocess
|
bool
|
Whether to reprocess the datasets even if they are already processed on disk. |
False
|
meld_open_repo
|
str | None
|
Repository ID on Huggingface Hub or path to preprocessed datasets in MELD format which will be loaded directly, bypassing processing from source for these datasets.
If set to |
'kgnlp/meld-open'
|
sentence_span_path
|
Path | None
|
Reproduces the sentence tokenization bundled with the package and stores spans for each sentence in the given directory. Intended for full reproducibility and addition of new datasets. |
None
|
Source code in meld/data.py
main(args=None)
¶
Main entry point for the MELD data management CLI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
Sequence[str] | None
|
Command line arguments. If None, arguments are parsed from
|
None
|
Source code in meld/data.py
merge_data(data_directory, output=None, label_config=None, merge_documents=False)
¶
Merges data from multiple datasets into a single parquet output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_directory
|
Path
|
Directory containing processed benchmark datasets. |
required |
output
|
Path | IO[bytes] | None
|
Output path for merged data, or stdout if None. |
None
|
label_config
|
dict[str, str] | None
|
Configuration mapping dataset names to their tagsets for multi-tagset datasets. |
None
|
merge_documents
|
bool
|
Whether to merge multiple sentences/paragraphs into single documents. |
False
|
Source code in meld/data.py
424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 | |
sample_data(data_directory, language, subset_size, output=None, split='train', tagset_config=None, merge_documents=False, keep_documents_without_entities=True, keep_discontinuous_spans=False, target_num_tokens=None, aggregation_tokenizer='google/gemma-3-27b-it')
¶
Samples and processes data from a specified directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_directory
|
Path
|
The path to the directory containing the benchmark data. |
required |
language
|
str
|
The ISO 639-3 code of the target language to sample. |
required |
subset_size
|
int
|
The number of samples to extract per dataset. |
required |
output
|
Path | IO[bytes] | None
|
The destination for the output, either a file path or a writable IO object. Defaults to standard output. |
None
|
split
|
str
|
The dataset split to process, e.g., "train", "validation" |
'train'
|
tagset_config
|
dict[str, str] | None
|
Indicates which tagset to use for datasets with
multiple tag sets for each sample. E.g. |
None
|
merge_documents
|
bool
|
Whether to merge documents consisting of multiple sentences or paragraphs into a single sample. |
False
|
keep_documents_without_entities
|
bool
|
Whether to keep documents without entities. |
True
|
keep_discontinuous_spans
|
bool
|
Whether to keep discontinuous spans. By default, only continuous spans are kept and flattened into simplified span annotations. |
False
|
target_num_tokens
|
int | None
|
If |
None
|
aggregation_tokenizer
|
str
|
Tokenizer used for counting tokens if
|
'google/gemma-3-27b-it'
|