CONLL¶
CoNLL format definitions and readers.
CoNLL
dataclass
¶
Simple CoNLL-style format with form and NER tag columns.
Attributes:
| Name | Type | Description |
|---|---|---|
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
CoNLL2003
dataclass
¶
Standard CoNLL2003 format with -DOCSTART- handling.
Attributes:
| Name | Type | Description |
|---|---|---|
docstart |
str
|
Class variable indicating the document start marker. |
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
form |
str
|
The token form/word. |
pos |
MaybeHyphen
|
Part-of-speech tag. |
syntactic_chunk |
Annotated[BIO | None, BeforeValidator(from_optional_string), maybe_empty('-'), PlainSerializer(str, when_used=unless - none)]
|
Syntactic chunk tag in BIO format. |
ner |
BIOField
|
The Named Entity Recognition tag in BIO format. |
Source code in meld/conll.py
CoNLL2003IgnoreDocstart
dataclass
¶
Bases: CoNLL2003
CoNLL2003 format that ignores -DOCSTART- markers.
Attributes:
| Name | Type | Description |
|---|---|---|
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
Source code in meld/conll.py
CoNLL2003Pioner
dataclass
¶
CoNLL2003-style format with underscore placeholders and no -DOCSTART- for the pioNER dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
form |
str
|
The token text. |
pos |
MaybeUnderscore
|
Part-of-speech tag. |
syntactic_chunk |
Annotated[BIO | None, BeforeValidator(from_optional_string), maybe_empty(), PlainSerializer(str, when_used=unless - none)]
|
Syntactic chunk tag in BIO format. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
CoNLL2003TwoColumn
dataclass
¶
Two-column CoNLL2003-style format with only word form and NER columns.
Attributes:
| Name | Type | Description |
|---|---|---|
docstart |
str
|
Class variable indicating the document start marker. |
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
CoNLLBioFirst
dataclass
¶
Simple CoNLL-style format in which NER tags are stored in the first column.
Attributes:
| Name | Type | Description |
|---|---|---|
ner |
BIOField
|
The NER tag in BIO format. |
form |
str
|
The token text. |
Source code in meld/conll.py
CoNLLColumnarRegistry
¶
CoNLLDialectRegistry
¶
CoNLLHerodotos
dataclass
¶
Two-column CoNLL-style format used by the Herodotos-Project-NER dataset. It is similar to CoNLLBioFirst but with inverted IOB tags (e.g., PERS-B instead of B-PERS) and "0" instead of "O".
Attributes:
| Name | Type | Description |
|---|---|---|
ner |
Annotated[BIO, BeforeValidator(_herodotos_inverted_bio_from_string), PlainSerializer(str)]
|
The Named Entity Recognition tag in inverted BIO format. |
form |
str
|
The token form/word. |
Source code in meld/conll.py
CoNLLJNLPBA
dataclass
¶
CoNLL format for JNLPBA dataset with MEDLINE document markers.
Attributes:
| Name | Type | Description |
|---|---|---|
docstart |
str
|
Class variable indicating the document start marker. |
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
CoNLLMetadata
dataclass
¶
Stores CoNLL format metadata, handling both regular comments and metadata encoded as key-value pairs.
Attributes:
| Name | Type | Description |
|---|---|---|
meta |
dict[str, str]
|
Dictionary of key-value metadata. |
comments |
list[str]
|
List of other comment strings. |
Source code in meld/conll.py
__contains__(metadata_key)
¶
Check if a metadata key exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_key
|
str
|
The key to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
__getitem__(metadata_key)
¶
Retrieve metadata by key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_key
|
str
|
The key to look up. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The metadata value. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the key is not found. |
Source code in meld/conll.py
get_meta(metadata_key)
¶
Retrieve metadata by key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_key
|
str
|
The key to look up. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
The metadata value, or None if not found. |
with_key_value(comments)
classmethod
¶
Create CoNLLMetadata from an iterable of plain comments and parsed key-value pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comments
|
Iterable[tuple[str, str] | str]
|
Iterable of tuples (key, value) or plain comment strings. |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new CoNLLMetadata instance. |
Source code in meld/conll.py
CoNLLStackOverflowNER
dataclass
¶
CoNLL format for the StackOverflow-NER dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
docstart |
str
|
Class variable indicating the document start marker. |
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
form_2 |
str
|
Secondary form field. |
markdown |
BIOField
|
Markdown syntax metadata in BIO format. |
Source code in meld/conll.py
CoNLLU
dataclass
¶
CoNLL-U format with 10 standard columns. For reference, see: https://universaldependencies.org/format.html
Attributes:
| Name | Type | Description |
|---|---|---|
id |
Annotated[int | Decimal | range, PlainValidator(conllu_id)]
|
Token ID (int, Decimal for empty nodes, or range for multi- word tokens). |
form |
str
|
Token text. |
lemma |
str
|
Lemma or stem. |
upos |
MaybeUnderscore
|
Universal part-of-speech tag. |
xpos |
MaybeUnderscore
|
Language-specific part-of-speech tag. |
feats |
Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]
|
Dictionary of morphological features. |
head |
Annotated[int | None, maybe_empty()]
|
Head token ID. |
deprel |
MaybeUnderscore
|
Dependency relation. |
deps |
Annotated[dict[int, str] | None, key_value_list(key_value_sep=':'), maybe_empty()]
|
Enhanced dependency graph in the form of a list of head-deprel pairs. |
misc |
Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]
|
Dictionary of other miscellaneous information. |
Source code in meld/conll.py
is_document_start(metadata)
staticmethod
¶
Check if the metadata indicates the start of a new document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
CoNLLMetadata
|
The CoNLLMetadata to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
a new document. |
Source code in meld/conll.py
parse_comments(comments)
staticmethod
¶
Parse CoNLL-U style comments into CoNLLMetadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comments
|
list[str]
|
List of comment strings starting with "#". |
required |
Returns:
| Type | Description |
|---|---|
CoNLLMetadata
|
Parsed CoNLLMetadata. |
Source code in meld/conll.py
CoNLLUPlus
dataclass
¶
CoNLL-U Plus format with NER tags used by the NYTK-NerKor dataset. For reference, see: https://universaldependencies.org/ext-format.html
Attributes:
| Name | Type | Description |
|---|---|---|
form |
str
|
The token text. |
lemma |
str
|
Lemma or stem. |
upos |
MaybeUnderscore
|
Universal part-of-speech tag. |
xpos |
MaybeUnderscore
|
Language-specific part-of-speech tag. |
feats |
Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]
|
Dictionary of morphological features. |
ner |
BIOField
|
The NER tag in BIO format. |
emmorph_lemma |
str
|
Lemma derived from the Hungarian emMorph morphological analyzer |
Source code in meld/conll.py
is_document_start(_)
staticmethod
¶
Check if the metadata indicates the start of a new document.
Returns:
| Type | Description |
|---|---|
bool
|
Always returns True for CoNLLUPlus. |
parse_comments(comments)
staticmethod
¶
Parse CoNLL-U style comments into CoNLLMetadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comments
|
list[str]
|
List of comment strings starting with "#". |
required |
Returns:
| Type | Description |
|---|---|
CoNLLMetadata
|
Parsed CoNLLMetadata. |
Source code in meld/conll.py
CoNLLWeiboNER
dataclass
¶
CoNLL-style format from the WeiboNER dataset in which word position indices are stored as part of the word form column. Word position indices are removed during deserialization.
Attributes:
| Name | Type | Description |
|---|---|---|
form |
Annotated[str, BeforeValidator(_strip_word_position)]
|
The token form/word with word position indices removed. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
CoNLLWithPOS
dataclass
¶
CoNLL-style format with POS and NER tags.
Attributes:
| Name | Type | Description |
|---|---|---|
form |
str
|
The token text. |
pos |
str
|
The part-of-speech tag. |
ner |
BIOField
|
The NER tag in BIO format. |
Source code in meld/conll.py
ColumnarCoNLL
¶
Bases: Protocol
Protocol for CoNLL-style formats that support column-based segmentation.
Source code in meld/conll.py
segment_columns(rows)
staticmethod
¶
Segment sentence indexed rows into sentence-level groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rows
|
Iterable[dict[str, Any]]
|
Iterable of indexed rows. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[list[dict[str, Any]]]
|
Iterable of sentence-level row groups. |
Source code in meld/conll.py
CommentedCoNLL
¶
Bases: Protocol
Protocol for CoNLL-style formats that support comment-based metadata.
Source code in meld/conll.py
is_document_start(metadata)
staticmethod
¶
Check if metadata indicates the start of a new document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
CoNLLMetadata
|
The CoNLLMetadata to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True at the start of a new document. |
Source code in meld/conll.py
parse_comments(comments)
staticmethod
¶
Parse CoNLL metadata from comments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comments
|
list[str]
|
List of comment strings. |
required |
Returns:
| Type | Description |
|---|---|
CoNLLMetadata
|
Parsed CoNLLMetadata. |
DocumentSeparatedCoNLL
¶
Bases: Protocol
Protocol for CoNLL-style formats separated by document markers.
Attributes:
| Name | Type | Description |
|---|---|---|
docstart |
str
|
Class variable indicating the document start marker. |
ignore_docstart |
bool
|
Class variable indicating whether to ignore document start markers. |
Source code in meld/conll.py
FlatIndices
dataclass
¶
CoNLL-style columnar format for data without blank lines, indicating documents and sentences via index columns.
Attributes:
| Name | Type | Description |
|---|---|---|
ner |
int
|
NER label. |
form |
str
|
The token text. |
doc_idx |
int
|
Document index. |
sent_idx |
int
|
Sentence index. |
Source code in meld/conll.py
segment_columns(rows)
staticmethod
¶
Segment flat index rows into sentence-level groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rows
|
Iterable[dict[str, Any]]
|
Iterable of flat index rows. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[list[dict[str, Any]]]
|
Iterable of sentence-level row groups. |
Source code in meld/conll.py
NerSuiteCoNLL
dataclass
¶
NERSuite CoNLL-style format with character indices mapping tokens to the source text. For reference, see: https://nersuite.nlplab.org/advanced_usage.html
Attributes:
| Name | Type | Description |
|---|---|---|
ner |
BIOField
|
The NER tag in BIO format. |
start |
int
|
Start character offset of the token in the source text. |
end |
int
|
End character offset of the token in the source text. |
form |
str
|
The token text. |
lemma |
str
|
Lemma or stem. |
pos |
MaybeUnderscore
|
Part-of-speech tag. |
chunk |
BIOField
|
Chunk tag in BIO format. |
Source code in meld/conll.py
RowParser
¶
Parser for CoNLL-style rows supporting arbitrary CoNLL dialects.
Type Parameters:
T The CoNLL dialect class used for parsing rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dialect
|
type[T]
|
The CoNLL dialect class to parse rows with. |
required |
Source code in meld/conll.py
validate_row(row)
¶
Validate and parse a row using the parser's CoNLL dialect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row
|
list[str]
|
List of fields in the row. |
required |
Returns:
| Type | Description |
|---|---|
T
|
Parsed Sentence using the parser's CoNLL dialect. |
Source code in meld/conll.py
Sentence
dataclass
¶
A parsed CoNLL-style sentence with optional metadata.
Type Parameters:
T The type of rows in the sentence.
Attributes:
| Name | Type | Description |
|---|---|---|
rows |
list[T]
|
List of parsed CoNLL-style rows. |
meta |
CoNLLMetadata | None
|
Optional CoNLLMetadata for the sentence. |
Source code in meld/conll.py
UNERCoNLLU
dataclass
¶
Universal NER CoNLL-U-style format with UNER tags and (optionally) original NER tags.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
int | Decimal
|
Token ID. |
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
original_ner |
Annotated[list[BIO] | None, BeforeValidator(_maybe_multi_label), maybe_empty('-')]
|
Original NER tags for converted datasets. |
annotator |
MaybeHyphen
|
Optional name of the annotator. |
Source code in meld/conll.py
is_document_start(metadata)
staticmethod
¶
Check if the metadata indicates the start of a new document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
CoNLLMetadata
|
The CoNLLMetadata to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
a new document. |
Source code in meld/conll.py
parse_comments(comments)
staticmethod
¶
Parse CoNLL-U style comments into CoNLLMetadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comments
|
list[str]
|
List of comment strings starting with "#". |
required |
Returns:
| Type | Description |
|---|---|
CoNLLMetadata
|
Parsed CoNLLMetadata. |
Source code in meld/conll.py
UNERCoNLLUNewPar
dataclass
¶
Bases: UNERCoNLLU
Variant of the Universal NER CoNLL-U-style format using newpar comments as a document separator
Attributes:
| Name | Type | Description |
|---|---|---|
id |
int | Decimal
|
Token ID. |
form |
str
|
The token text. |
ner |
BIOField
|
The NER tag in BIO format. |
original_ner |
Annotated[list[BIO] | None, BeforeValidator(_maybe_multi_label), maybe_empty('-')]
|
Original NER tags for converted datasets. |
annotator |
MaybeHyphen
|
Optional name of the annotator. |
Source code in meld/conll.py
is_document_start(metadata)
staticmethod
¶
Check if the metadata indicates the start of a new document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
CoNLLMetadata
|
The CoNLLMetadata to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
a new document. |
Source code in meld/conll.py
conllu_id(token_id)
¶
Parses CoNLL-U token IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token_id
|
str | float | Decimal
|
Token ID string representation (may contain hyphen for multi-word tokens or decimal for empty nodes). |
required |
Returns:
| Type | Description |
|---|---|
int | Decimal | range
|
An integer for simple IDs, Decimal for empty nodes, or range for multi- |
int | Decimal | range
|
word tokens. |
Raises:
| Type | Description |
|---|---|
ValueError
|
When a decimal ID is less than or equal to 0. |
Source code in meld/conll.py
key_value_list(list_sep='|', key_value_sep='=', allow_empty=True)
¶
Create a Pydantic BeforeValidator for parsing key-value list strings into dictionaries.
The input string is parsed as a list of key-value pairs, where each pair is separated
by list_sep and the key and value within each pair are separated by key_value_sep.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
list_sep
|
str
|
Separator between key-value pairs. |
'|'
|
key_value_sep
|
str
|
Separator between keys and values. |
'='
|
allow_empty
|
bool
|
Whether an empty list input should be allowed or
raise a |
True
|
Returns:
| Type | Description |
|---|---|
BeforeValidator
|
A Pydantic BeforeValidator that parses key-value list strings. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
In the returned validator if the input is |
Source code in meld/conll.py
maybe_empty(empty_value='_')
¶
Create a Pydantic BeforeValidator that converts strings containing a CoNLL-style empty value to None.
The validator will raise a ValueError if the input string is the empty string unless empty_value is also set to the empty string
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
empty_value
|
str
|
The string value representing empty fields. |
'_'
|
Returns:
| Type | Description |
|---|---|
BeforeValidator
|
A Pydantic BeforeValidator that transforms empty values to None. |
Source code in meld/conll.py
parse(lines, dialect=CoNLL, delimiter='\t', enforce_blank_lines=True, use_comment_document_boundary=True)
¶
Parse CoNLL-style lines into documents with sentences.
Type Parameters:
T The CoNLL dialect class used for parsing rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lines
|
Iterable[str]
|
Iterable of line strings. |
required |
dialect
|
type[T]
|
The CoNLL dialect class to use for parsing. |
CoNLL
|
delimiter
|
str
|
Field delimiter separating the CoNLL-style columns. |
'\t'
|
enforce_blank_lines
|
bool
|
Whether to enforce that blank lines between segments are empty. Otherwise, lines that contain only whitespace will be treated as blank |
True
|
use_comment_document_boundary
|
bool
|
Whether to parse document boundaries from CoNLL-U-style comment. |
True
|
Returns:
| Type | Description |
|---|---|
Iterator[list[Sentence[T]]]
|
Iterator over documents (lists of sentences). |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
When combining -DOCSTART- with CoNLL-U style comments. |
Source code in meld/conll.py
980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 | |
parse_columns(rows, dialect)
¶
Parse columnar rows into sentences using the specified CoNLL dialect.
Type Parameters:
C The CoNLL dialect class used for parsing rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rows
|
Iterable[dict[str, Any]]
|
Iterable of row dictionaries. |
required |
dialect
|
type[C]
|
The CoNLL dialect class to use for parsing. |
required |
Yields: Parsed sentences.
Source code in meld/conll.py
parse_key_value(comment)
¶
Parses a CoNLL-U comment line into a key-value pair if possible and returns the input string otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comment
|
str
|
Comment line from CoNLL-U format. |
required |
Returns:
| Type | Description |
|---|---|
tuple[str, str] | str
|
(key, value) tuple for key=value format, or a plain string |
tuple[str, str] | str
|
otherwise. |
Source code in meld/conll.py
space_separated_segments(lines, enforce_blank_lines=True)
¶
Split lines into segments separated by blank lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lines
|
Iterable[str]
|
Iterable of line strings. |
required |
enforce_blank_lines
|
bool
|
Whether to enforce blank lines as segment separators. |
True
|
Yields: Segments as lists of lines.