Manifest¶
Dataset manifest and configuration classes.
Agreement
¶
Inter-annotator agreement metrics, if reported.
Attributes:
| Name | Type | Description |
|---|---|---|
value |
dict[str, float] | float
|
Agreement score (single value or per-label scores). |
metric |
str | None
|
Name of the agreement metric used. |
Source code in meld/manifest.py
AnnotationMetadata
¶
Metadata about the annotation process.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
str | None
|
Type of annotation (e.g., manual, heuristic). |
annotator_count |
int | str | None
|
Number of annotators. |
features |
list[str] | None
|
Notable features of the annotation process. |
agreement |
Agreement | list[Agreement] | None
|
Inter-annotator agreement metrics reported for the data. |
Source code in meld/manifest.py
merge(other)
¶
Merges this annotation metadata with another instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Self | None
|
Another AnnotationMetadata instance to merge with or None. |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new AnnotationMetadata with values from |
Self
|
precedence. |
Source code in meld/manifest.py
ByteOffsetJSONLArguments
¶
Configuration arguments for byte offset-based JSONL data.
Attributes:
| Name | Type | Description |
|---|---|---|
text_key |
str
|
Key for the text field in JSON objects. |
offsets_key |
str
|
Key for the byte offsets field in JSON objects. |
annotated_span_target_key |
str | None
|
Optional key containing expected string representations for each span for validation. |
Source code in meld/manifest.py
ByteOffsetJSONLConfiguration
¶
Configuration for the byte offset-based JSONL reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['byte_offset_jsonl']
|
The configuration type identifier ("byte_offset_jsonl"). |
arguments |
ByteOffsetJSONLArguments
|
ByteOffsetJSONL-specific arguments. |
Source code in meld/manifest.py
CoNLLArguments
¶
Configuration arguments for CoNLL format data processing.
Attributes:
| Name | Type | Description |
|---|---|---|
shards_are_documents |
bool
|
Whether each shard file represents a single, complete document. |
dialect |
CoNLLDialectNames
|
The CoNLL dialect variant to use. |
delimiter |
str
|
Field delimiter in the CoNLL file. |
label_map |
dict[str, dict[BIOField, int]] | None
|
Optional mapping of tagsets to label index mappings for CoNLL-style data that uses tag indices. |
bioes_to_bio |
bool
|
Whether to convert BIOES tags to BIO format. |
enforce_blank_lines |
bool
|
Whether to enforce that blank lines between sentences do not contain whitespace. |
preprocessor |
Literal['e-ner', 'stackoverflow_ner', 'pioner', 'nytk_nerkor'] | None
|
Optional preprocessor for specific datasets that is run on the raw data before parsing. |
Source code in meld/manifest.py
CoNLLConfiguration
¶
Configuration for the CoNLL-style data reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['conll']
|
The configuration type identifier ("conll"). |
arguments |
CoNLLArguments
|
CoNLL-specific arguments. |
detokenizer_type |
DetokenizerType
|
Strategy for detokenizing tokens. |
Source code in meld/manifest.py
ConvertStep
¶
A data pipe step that converts data from a source format to the normalized MELD format using a source format specific reader.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['convert']
|
The step type identifier ("convert"). |
reader |
ReaderConfiguration
|
The reader configuration to use for parsing. |
filter_empty_documents |
bool
|
Whether to explicitly remove all documents with empty text received from the reader.
If set to |
Source code in meld/manifest.py
DataSource
¶
A source from which the original raw or annotated data was collected.
Attributes:
| Name | Type | Description |
|---|---|---|
source |
str
|
Name or identifier of the source. |
url |
str
|
URL to the source. |
Source code in meld/manifest.py
DatasetConvertStep
¶
A step that converts Huggingface datasets to the normalized MELD format.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['columns']
|
The step type identifier ("columns"). |
tagsets |
dict[str, str | TagSet] | None
|
Mapping of tagset names to column names or TagSet configurations. |
sequence_type |
Literal['sentence', 'passage']
|
Whether sequences should be treated as sentences or passages. |
detokenizer_type |
DetokenizerType
|
Strategy for detokenizing text. |
bio_type |
Literal['iob', 'iob_type_only']
|
Type of BIO tags to parse. Options are "iob" (standard IOB format) or "iob_type_only" (IOB format without an "I-" or "B-" prefix). Defaults to "iob". |
filter_empty_documents |
bool
|
Whether to explicitly remove all documents with empty text received from the reader.
If set to |
Source code in meld/manifest.py
DatasetPartition
¶
Configuration for a dataset partition.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['subset', 'split']
|
Whether this is a subset or split. |
name |
str
|
Name of the partition. |
Source code in meld/manifest.py
DownloadStep
¶
A data pipe step that downloads files from URLs with checksum verification.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['download']
|
The step type identifier ("download"). |
urls |
list[URLWithChecksum]
|
List of URLs with their expected checksums. |
Source code in meld/manifest.py
EBMNLPStandoffArguments
¶
Configuration arguments for the EBM-NLP dataset's standoff format.
Attributes:
| Name | Type | Description |
|---|---|---|
label_map |
dict[str, dict[str, int]]
|
Mapping of label names to integer indices for each tagset. |
broad_label_map |
dict[str, dict[str, int]]
|
Mapping of broad label names to integer indices for each tagset. |
Source code in meld/manifest.py
EBMNLPStandoffConfiguration
¶
Configuration for EBM-NLP standoff reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['ebm_nlp_standoff']
|
The configuration type identifier ("ebm_nlp_standoff"). |
arguments |
EBMNLPStandoffArguments
|
EBMNLPStandoff-specific arguments. |
Source code in meld/manifest.py
ExtractStep
¶
A step that extracts files from a compressed archive.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['extract']
|
The step type identifier ("extract"). |
from_file |
str
|
The compressed file to extract from. |
files |
list[str]
|
List of file paths to extract. |
use_globs |
bool
|
Whether to treat file paths as glob patterns. |
Source code in meld/manifest.py
FileSubset
¶
Configuration for the processing data pipe and splits of a subset.
Attributes:
| Name | Type | Description |
|---|---|---|
train |
SplitFiles
|
Training split file specifications. |
validation |
SplitFiles
|
Validation split file specifications. |
test |
SplitFiles
|
Test split file specifications. |
data_pipe |
Annotated[list[GenericDataPipeStep], Field(min_length=1)] | None
|
Optional data pipe for this specific subset. If not
defined or set to |
directory |
str
|
Optional subdirectory containing the data files. |
language |
str | None
|
Optional language code for this subset. This should be an ISO 639-3 language code if possible. |
Source code in meld/manifest.py
Format
¶
Metadata providing details about the source data format of the dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
Literal['pre-tokenized', 'original']
|
Whether text is pre-tokenized or original documents are preserved. |
tags |
Literal['bio', 'bioes', 'spans', 'discontinuous_spans']
|
Type of tag format (BIO, BIOES, spans, or discontinuous spans). |
text_properties |
list
|
Additional metadata concerning the text content of the dataset. |
token_format |
Literal['wikiann'] | None
|
Optional tokenizer details for pre-tokenized datasets (e.g., WikiANN using hashes to represent whitespace in certain languages). |
tag_format |
Literal['indices'] | None
|
Optional tag format (e.g., whether tags are represented by indices). |
tag_alignment |
Literal['offsets', 'byte_offsets'] | None
|
How tags are aligned to text (offsets or byte offsets). |
token_alignment |
Literal['offsets'] | None
|
How tokens are aligned to text . |
text_alignment |
Literal['offsets'] | None
|
How text segments are aligned. |
Source code in meld/manifest.py
GenericArguments
¶
Configuration arguments for generic data loading.
Attributes:
| Name | Type | Description |
|---|---|---|
download_data_pipe |
list[GenericDataPipeStep]
|
Data processing data pipe for downloading and processing data. |
Source code in meld/manifest.py
GenericLoader
¶
Loader configuration for generic data loading and processing from web sources or local files.
Attributes:
| Name | Type | Description |
|---|---|---|
loader |
Literal['generic']
|
The loader type identifier ("generic"). |
arguments |
GenericArguments
|
Generic data processing arguments. |
Source code in meld/manifest.py
GitLoader
¶
Loader configuration for Git repositories.
Attributes:
| Name | Type | Description |
|---|---|---|
loader |
Literal['git']
|
The loader type identifier ("git"). |
arguments |
GitLoaderArguments
|
Git data processing arguments. |
Source code in meld/manifest.py
GitLoaderArguments
¶
Configuration arguments for downloading data from a Git repository.
Attributes:
| Name | Type | Description |
|---|---|---|
repo |
str
|
Git repository URL. |
revision |
str
|
Repository version (preferably commit hash). |
subsets |
dict[str, FileSubset]
|
Mapping of subset names to subset configurations. |
base_language |
str | None
|
Base language code for the dataset. This should be an ISO 639-3 language code if possible. |
default_data_pipe |
list[GenericDataPipeStep]
|
Default data pipe to use for processing. Each subset can override the default data pipe for subset- specific processing. |
keep_repo |
bool
|
Whether to keep the cloned repository after relevant files have been extracted. |
Source code in meld/manifest.py
GitStep
¶
A data pipe step that clones a Git repository and extracts the given files.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['git']
|
The step type identifier ("git"). |
repo |
str
|
The Git repository URL to clone. |
revision |
str
|
The commit hash to checkout. |
files |
list[str]
|
List of file paths to extract from the repository.
Relative to |
directory |
str
|
Optional base directory to which the paths in |
keep_repo |
bool
|
Whether to keep the cloned repository on disk after extraction. |
Source code in meld/manifest.py
GoogleDocsStep
¶
A data pipe step that downloads files from Google Docs.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['google_docs']
|
The step type identifier ("google_docs"). |
urls |
list[URLWithTarget]
|
List of Google Docs URLs with target filenames and checksums. |
Source code in meld/manifest.py
HuggingfaceArguments
¶
Configuration arguments for loading data from Huggingface datasets.
Note: Only one of language_column, language_from_subset_name, and base_language can be specified at a time.
Attributes:
| Name | Type | Description |
|---|---|---|
repo |
str
|
Huggingface dataset repository ID. |
revision |
str
|
Repository version (ideally commit hash). |
text_column |
str
|
Name of the text column in the dataset. |
tag_column |
str
|
Name of the tag column in the dataset. |
train_name |
str | None
|
Name of the training split. |
validation_name |
str | None
|
Name of the validation split. |
test_name |
str | None
|
Name of the test split. |
base_language |
str | None
|
Base language code for the dataset. This should be an ISO 639-3 language code if possible. |
language_column |
str | None
|
Optional name of the column containing language codes for splitting the dataset into language subsets. Will be converted to ISO 639-3 automatically, if possible. |
language_from_subset_name |
str | None
|
Pattern to dynamically extract the language from subset names. Will be converted to ISO 639-3 automatically, if possible. |
fast_subset_load |
bool
|
Whether to use optimized loading for datasets with many subsets (such as WikiANN). |
trust_remote_code |
bool
|
Whether to trust remote code execution. |
split_naming_pattern |
str | None
|
Pattern for naming splits. |
data_pipe |
list[DataPipeStep]
|
Data processing data pipe steps to apply to the dataset. |
data_files |
SubsetDataFiles | dict[str, list[str] | str] | None
|
Manual data file specifications which will override
the manual file discovery of the |
Source code in meld/manifest.py
552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 | |
validate_base_language_and_splits()
¶
Validates that at least one split is configured and language configuration is consistent.
Returns:
| Type | Description |
|---|---|
Self
|
Self for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no splits are named or if language configurations conflict. |
Source code in meld/manifest.py
HuggingfaceLoader
¶
Loader configuration for HuggingFace datasets.
Attributes:
| Name | Type | Description |
|---|---|---|
loader |
Literal['huggingface']
|
The loader type identifier ("huggingface"). |
arguments |
HuggingfaceArguments
|
HuggingFace data processing arguments. |
Source code in meld/manifest.py
Licenses
¶
License information for annotations and text.
Attributes:
| Name | Type | Description |
|---|---|---|
annotations |
str | dict[str, str]
|
License for annotations (string or per-source mapping). |
text |
str | dict[str, str]
|
License for text content (string or per-source mapping). |
Source code in meld/manifest.py
licenses()
¶
Collects all unique licenses in a sorted list
Returns:
| Type | Description |
|---|---|
list[str]
|
All unique licenses in lexographically sorted order |
Source code in meld/manifest.py
MELDDataset
¶
Complete dataset definition, including metadata and preprocessing data pipes for integration into MELD.
Attributes:
| Name | Type | Description |
|---|---|---|
citekeys |
list[str]
|
Citation keys for this dataset in the included BibTeX bibliography. |
source |
Loader
|
Data loader configuration. |
format |
Format
|
Data format specification. |
metadata |
list[SubMetadata]
|
List of dataset, subset and split-specific metadata that can be resolved via a CSS-style cascade to reduce repetition. |
settings |
dict[str, list[DatasetPartition]] | None
|
Any evaluation settings defined for the dataset (such as coarse-grained and fine-grained, few-shot, etc.) and which subsets or splits they include. |
note |
str | None
|
Optional notes about this dataset. |
use_shared_cache |
str | None
|
Whether to use a shared cache for downloaded resources in cases where multiple datasets are downloaded from the same source. |
Source code in meld/manifest.py
Metadata
¶
General metadata for a dataset or subset.
Attributes:
| Name | Type | Description |
|---|---|---|
license |
str | Licenses | None
|
License information (string or |
annotation |
AnnotationMetadata | None
|
Annotation process metadata. |
primary_domain |
str | None
|
Primary domain of the data (e.g., medical, legal). |
other_domains |
list[str] | None
|
Additional broad domains present in the data. |
finegrained_domains |
list[str] | None
|
Fine-grained domains present in the data. |
data_sources |
list[str | DataSource] | None
|
List of data sources. |
dataset_lineage |
list[str] | None
|
Provenance information if the data was derived from on or multiple previously published datasets. |
label_set_standard |
str | None
|
Standard or convention of the label set (such as OntoNotes or XBRL tags). |
document_boundaries |
Literal['full', 'partial', 'none'] | None
|
Whether the original documents or parts of documents can be restored from the data based on available boundary information or file structure. |
sentence_boundaries |
SentenceBoundaryType
|
Whether the data is segmented into sentences or sections. |
Source code in meld/manifest.py
888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 | |
licenses()
¶
Collects all unique licenses of the data in a sorted list
Returns:
| Type | Description |
|---|---|
list[str]
|
All unique licenses in lexographically sorted order |
Source code in meld/manifest.py
merge(other)
¶
Merges this metadata with another instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Metadata
|
Another Metadata instance to merge with. |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
A new Metadata with values from other taking precedence |
Metadata
|
where None. |
Source code in meld/manifest.py
NestedDataPipeStep
¶
A data pipe step that defines a nested data pipe.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['data_pipe']
|
The step type identifier ("data_pipe"). |
data_pipe |
list[GenericDataPipeStep]
|
List of other data pipe steps to execute, including potentially other nested data pipe steps. |
Source code in meld/manifest.py
OffsetCSVArguments
¶
Configuration arguments for offset-based CSV data.
Attributes:
| Name | Type | Description |
|---|---|---|
text_column |
str
|
Name of the column containing the text. |
offsets_column |
str
|
Name of the column containing character offsets for entity span annotations. |
Source code in meld/manifest.py
OffsetCSVConfiguration
¶
Configuration for the offset-based CSV data reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['offset_csv']
|
The configuration type identifier ("offset_csv"). |
arguments |
OffsetCSVArguments
|
OffsetCSV-specific arguments. |
Source code in meld/manifest.py
PlainSpanArguments
¶
Configuration arguments for plain span data.
Attributes:
| Name | Type | Description |
|---|---|---|
span_format |
Literal['json', 'python']
|
Format of span annotations (JSON or Python dictionary style). |
Source code in meld/manifest.py
PlainSpanConfiguration
¶
Configuration for plain span reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['plain_spans']
|
The configuration type identifier ("plain_spans"). |
arguments |
PlainSpanArguments
|
PlainSpan-specific arguments. |
Source code in meld/manifest.py
ReadSplitStep
¶
A data pipe step that splits data based on external metadata files.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['read_splits']
|
The step type identifier ("read_splits"). |
language |
str
|
The language of the data. This should be an ISO 639-3 language code if possible. |
split_files |
str | dict[str, str]
|
Path or mapping of split names to file paths. |
splits_reader |
Literal['legalnero_standoff_split_columns', 'agriner_standoff_split_json', 'somesci_standoff_split_json']
|
The split metadata reader implementation to use. |
directories |
str | dict[str, str]
|
Optional directory or mapping of directories for files. |
split_name_map |
dict[str, str] | None
|
Optional mapping to rename splits. |
subset |
str | None
|
Optional subset name within the data. |
metadata |
dict[str, Any]
|
Additional metadata for this step. |
Source code in meld/manifest.py
ReaderConfigurationWithoutArguments
¶
Configuration for reader formats that require no additional arguments.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['bioc_xml', 'pubtator', 'scirex_jsonl', 'scier_jsonl', 'arabic_cross_dialectal_json', 'dataset_spans']
|
The configuration type identifier for the desired reader format. |
Source code in meld/manifest.py
SofcStandoffArguments
¶
Configuration arguments for the SOFC dataset's standoff-style format.
Attributes:
| Name | Type | Description |
|---|---|---|
label_source |
Literal['frames', 'entities']
|
Source of labels to use as entity annotations for a given subset (frames or entities). |
Source code in meld/manifest.py
SofcStandoffConfiguration
¶
Configuration for SOFC standoff reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['sofc_standoff']
|
The configuration type identifier ("sofc_standoff"). |
arguments |
SofcStandoffArguments
|
SofcStandoff-specific arguments. |
Source code in meld/manifest.py
SplitSelector
¶
Selector for dataset splits containing the hierarchical subset path and an optional split name.
Attributes:
| Name | Type | Description |
|---|---|---|
subset_hierarchy |
list[str]
|
Path through the subset hierarchy (list of subset names). |
split |
str | None
|
Optional split name (train, validation, test, or None). |
Source code in meld/manifest.py
973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 | |
matches(hierarchy, split)
¶
Checks if this selector matches the given hierarchy and split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hierarchy
|
list[str]
|
The subset hierarchy to check against. |
required |
split
|
str
|
The split name to check against. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this selector matches. |
Source code in meld/manifest.py
parse(selector)
classmethod
¶
Parses a selector string into a SplitSelector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selector
|
str
|
Selector string in format "subset1.subset2.split". Wild cards can be used for matching any subset on the specified level of the hierarchy using an asterisk, such as "subset1.*.split" |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new SplitSelector instance. |
Source code in meld/manifest.py
specificity()
¶
Calculates the specificity of this selector.
Returns:
| Type | Description |
|---|---|
int
|
Specificity of the selector in the form (1 or 0 indicating |
int
|
whether a specific split was specified, count of non- |
tuple[int, int]
|
wildcard selectors). |
Source code in meld/manifest.py
SplitStep
¶
A data pipe step that splits data into train/validation/test sets.
Attributes:
| Name | Type | Description |
|---|---|---|
step |
Literal['splits']
|
The step type identifier ("splits"). |
language |
str
|
The language of the data. This should be an ISO 639-3 language code if possible. |
directory |
str
|
Optional subdirectory containing the data files. |
train |
SplitFiles
|
Training split file specifications. |
validation |
SplitFiles
|
Validation split file specifications. |
test |
SplitFiles
|
Test split file specifications. |
subset |
str | None
|
Optional subset name within the data. |
metadata |
dict[str, Any]
|
Additional metadata for this split. |
Source code in meld/manifest.py
validate_splits()
¶
Validates that at least one split is defined.
Returns:
| Type | Description |
|---|---|
Self
|
The |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no splits are defined. |
Source code in meld/manifest.py
StandoffArguments
¶
Configuration arguments for standoff annotated data.
Attributes:
| Name | Type | Description |
|---|---|---|
offsets_without_newlines |
bool
|
Whether to exclude newlines from offset calculations. |
Source code in meld/manifest.py
StandoffConfiguration
¶
Configuration for the standoff annotation reader.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal['standoff']
|
The configuration type identifier ("standoff"). |
arguments |
StandoffArguments
|
Standoff-specific arguments. |
Source code in meld/manifest.py
SubMetadata
¶
Bases: Metadata
Metadata that applies to all splits matching the given selectors.
Attributes:
| Name | Type | Description |
|---|---|---|
license |
str | Licenses | None
|
License information (string or |
annotation |
AnnotationMetadata | None
|
Annotation process metadata. |
primary_domain |
str | None
|
Primary domain of the data (e.g., medical, legal). |
other_domains |
list[str] | None
|
Additional broad domains present in the data. |
finegrained_domains |
list[str] | None
|
Fine-grained domains present in the data. |
data_sources |
list[str | DataSource] | None
|
List of data sources. |
dataset_lineage |
list[str] | None
|
Provenance information if the data was derived from on or multiple previously published datasets. |
label_set_standard |
str | None
|
Standard or convention of the label set (such as OntoNotes or XBRL tags). |
document_boundaries |
Literal['full', 'partial', 'none'] | None
|
Whether the original documents or parts of documents can be restored from the data based on available boundary information or file structure. |
sentence_boundaries |
SentenceBoundaryType
|
Whether the data is segmented into sentences or sections. |
split |
list[Annotated[SplitSelector, BeforeValidator(parse)]]
|
List of split selectors that this metadata applies to. |
Source code in meld/manifest.py
SubsetDataFiles
¶
Configuration for dataset files organized by subset.
Attributes:
| Name | Type | Description |
|---|---|---|
subsets |
dict[str, dict[str, list[str] | str]]
|
Mapping of subset names to split file specifications. |
Source code in meld/manifest.py
TagSet
¶
Configuration of a tagset for data conversion and normalization.
Attributes:
| Name | Type | Description |
|---|---|---|
label_map |
dict[BIOField, int] | None
|
Optional mapping of labels to integer indices for handling formats where labels are represented by indices. |
column |
str | None
|
Optional column name to use for this tagset. |
Source code in meld/manifest.py
URLWithChecksum
¶
A URL with its SHA256 checksum for verification.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
str
|
The URL to download from. |
sha256 |
str
|
The expected SHA256 checksum of the downloaded file. |
Source code in meld/manifest.py
URLWithTarget
¶
A URL with target filename and SHA256 checksum for Google Docs downloads.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
str
|
The Google Docs URL to download from. |
target_filename |
str
|
The filename to save the downloaded file as. |
sha256 |
str
|
The expected SHA256 checksum of the downloaded file. |
Source code in meld/manifest.py
load_label_map(path_or_dict=None)
¶
Loads and deserializes a label map from a file or dictionary. Loads the included normalized label mapping if path_or_dict is None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_or_dict
|
PathLike | str | dict[str, Any] | None
|
Path to a label map JSON file, or a dictionary to validate. If |
None
|
Returns:
| Type | Description |
|---|---|
LabelMap
|
Parsed nested dictionary containing a mapping for each dataset, subset, and tagset. |
Source code in meld/manifest.py
load_manifest(path_or_dict=None)
¶
Loads and deserializes a dataset manifest from a file or dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_or_dict
|
PathLike | str | dict[str, Any] | None
|
Path to a manifest JSON file, or a dictionary to validate. |
None
|
Returns:
| Type | Description |
|---|---|
DatasetManifest
|
Parsed |