Skip to content

CONLL

CoNLL format definitions and readers.

CoNLL dataclass

Simple CoNLL-style format with form and NER tag columns.

Attributes:

Name Type Description
form str

The token text.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll")
@dataclass(slots=True)
class CoNLL:
    """
    Simple CoNLL-style format with form and NER tag columns.

    Attributes:
        form: The token text.
        ner: The NER tag in BIO format.
    """

    form: str
    ner: BIOField

CoNLL2003 dataclass

Standard CoNLL2003 format with -DOCSTART- handling.

Attributes:

Name Type Description
docstart str

Class variable indicating the document start marker.

ignore_docstart bool

Class variable indicating whether to ignore document start markers.

form str

The token form/word.

pos MaybeHyphen

Part-of-speech tag.

syntactic_chunk Annotated[BIO | None, BeforeValidator(from_optional_string), maybe_empty('-'), PlainSerializer(str, when_used=unless - none)]

Syntactic chunk tag in BIO format.

ner BIOField

The Named Entity Recognition tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll2003")
@dataclass(slots=True)
class CoNLL2003:
    """
    Standard CoNLL2003 format with -DOCSTART- handling.

    Attributes:
        docstart: Class variable indicating the document start marker.
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
        form: The token form/word.
        pos: Part-of-speech tag.
        syntactic_chunk: Syntactic chunk tag in BIO format.
        ner: The Named Entity Recognition tag in BIO format.
    """

    docstart: ClassVar[str] = "-DOCSTART-"
    ignore_docstart: ClassVar[bool] = False

    form: str
    pos: MaybeHyphen
    syntactic_chunk: Annotated[
        BIO | None,
        BeforeValidator(BIO.from_optional_string),
        maybe_empty("-"),
        PlainSerializer(str, when_used="unless-none"),
    ]
    ner: BIOField

CoNLL2003IgnoreDocstart dataclass

Bases: CoNLL2003

CoNLL2003 format that ignores -DOCSTART- markers.

Attributes:

Name Type Description
ignore_docstart bool

Class variable indicating whether to ignore document start markers.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll2003_ignore_docstart")
@dataclass(slots=True)
class CoNLL2003IgnoreDocstart(CoNLL2003):
    """
    CoNLL2003 format that ignores -DOCSTART- markers.

    Attributes:
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
    """

    ignore_docstart: ClassVar[bool] = True

CoNLL2003Pioner dataclass

CoNLL2003-style format with underscore placeholders and no -DOCSTART- for the pioNER dataset.

Attributes:

Name Type Description
form str

The token text.

pos MaybeUnderscore

Part-of-speech tag.

syntactic_chunk Annotated[BIO | None, BeforeValidator(from_optional_string), maybe_empty(), PlainSerializer(str, when_used=unless - none)]

Syntactic chunk tag in BIO format.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll2003_pioner")
@dataclass(slots=True)
class CoNLL2003Pioner:
    """
    CoNLL2003-style format with underscore placeholders and no -DOCSTART- for the pioNER dataset.

    Attributes:
        form: The token text.
        pos: Part-of-speech tag.
        syntactic_chunk: Syntactic chunk tag in BIO format.
        ner: The NER tag in BIO format.
    """

    form: str
    pos: MaybeUnderscore
    syntactic_chunk: Annotated[
        BIO | None,
        BeforeValidator(BIO.from_optional_string),
        maybe_empty(),
        PlainSerializer(str, when_used="unless-none"),
    ]
    ner: BIOField

CoNLL2003TwoColumn dataclass

Two-column CoNLL2003-style format with only word form and NER columns.

Attributes:

Name Type Description
docstart str

Class variable indicating the document start marker.

ignore_docstart bool

Class variable indicating whether to ignore document start markers.

form str

The token text.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll2003_two_column")
@dataclass(slots=True)
class CoNLL2003TwoColumn:
    """
    Two-column CoNLL2003-style format with only word form and NER columns.

    Attributes:
        docstart: Class variable indicating the document start marker.
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
        form: The token text.
        ner: The NER tag in BIO format.
    """

    docstart: ClassVar[str] = "-DOCSTART-"
    ignore_docstart: ClassVar[bool] = False

    form: str
    ner: BIOField

CoNLLBioFirst dataclass

Simple CoNLL-style format in which NER tags are stored in the first column.

Attributes:

Name Type Description
ner BIOField

The NER tag in BIO format.

form str

The token text.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_bio_first")
@dataclass(slots=True)
class CoNLLBioFirst:
    """
    Simple CoNLL-style format in which NER tags are stored in the first column.

    Attributes:
        ner: The NER tag in BIO format.
        form: The token text.
    """

    ner: BIOField
    form: str

CoNLLColumnarRegistry

Bases: Registry[T]

Registry for CoNLL dialects.

Source code in meld/conll.py
class CoNLLColumnarRegistry[T: ColumnarCoNLL](Registry[T]):
    """Registry for CoNLL dialects."""

CoNLLDialectRegistry

Bases: Registry[T]

Registry for CoNLL dialects.

Source code in meld/conll.py
class CoNLLDialectRegistry[T: CommonRowType](Registry[T]):
    """Registry for CoNLL dialects."""

CoNLLHerodotos dataclass

Two-column CoNLL-style format used by the Herodotos-Project-NER dataset. It is similar to CoNLLBioFirst but with inverted IOB tags (e.g., PERS-B instead of B-PERS) and "0" instead of "O".

Attributes:

Name Type Description
ner Annotated[BIO, BeforeValidator(_herodotos_inverted_bio_from_string), PlainSerializer(str)]

The Named Entity Recognition tag in inverted BIO format.

form str

The token form/word.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_herodotos")
@dataclass(slots=True)
class CoNLLHerodotos:
    """
    Two-column CoNLL-style format used by the Herodotos-Project-NER dataset. It is similar to `CoNLLBioFirst` but with inverted IOB tags (e.g., PERS-B instead of B-PERS) and "0" instead of "O".

    Attributes:
        ner: The Named Entity Recognition tag in inverted BIO format.
        form: The token form/word.
    """

    ner: Annotated[BIO, BeforeValidator(_herodotos_inverted_bio_from_string), PlainSerializer(str)]
    form: str

CoNLLJNLPBA dataclass

CoNLL format for JNLPBA dataset with MEDLINE document markers.

Attributes:

Name Type Description
docstart str

Class variable indicating the document start marker.

ignore_docstart bool

Class variable indicating whether to ignore document start markers.

form str

The token text.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_jnlpba")
@dataclass(slots=True)
class CoNLLJNLPBA:
    """
    CoNLL format for JNLPBA dataset with MEDLINE document markers.

    Attributes:
        docstart: Class variable indicating the document start marker.
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
        form: The token text.
        ner: The NER tag in BIO format.
    """

    docstart: ClassVar[str] = "###MEDLINE:"
    ignore_docstart: ClassVar[bool] = False

    form: str
    ner: BIOField

CoNLLMetadata dataclass

Stores CoNLL format metadata, handling both regular comments and metadata encoded as key-value pairs.

Attributes:

Name Type Description
meta dict[str, str]

Dictionary of key-value metadata.

comments list[str]

List of other comment strings.

Source code in meld/conll.py
@dataclass(slots=True)
class CoNLLMetadata:
    """
    Stores CoNLL format metadata, handling both regular comments and metadata encoded as key-value pairs.

    Attributes:
        meta: Dictionary of key-value metadata.
        comments: List of other comment strings.
    """

    meta: dict[str, str] = field(default_factory=dict)
    comments: list[str] = field(default_factory=list)

    def get_meta(self, metadata_key: str) -> str | None:
        """
        Retrieve metadata by key.

        Args:
            metadata_key: The key to look up.

        Returns:
            The metadata value, or None if not found.
        """

        return self.meta.get(metadata_key)

    def __getitem__(self, metadata_key: str) -> str:
        """
        Retrieve metadata by key.

        Args:
            metadata_key: The key to look up.

        Returns:
            The metadata value.

        Raises:
            KeyError: If the key is not found.
        """

        return self.meta[metadata_key]

    def __contains__(self, metadata_key: str) -> bool:
        """
        Check if a metadata key exists.

        Args:
            metadata_key: The key to check.

        Returns:
            `True` if the key exists in the key-value store.
        """

        return metadata_key in self.meta

    @classmethod
    def with_key_value(cls, comments: Iterable[tuple[str, str] | str]) -> Self:
        """
        Create CoNLLMetadata from an iterable of plain comments and parsed key-value pairs.

        Args:
            comments: Iterable of tuples (key, value) or plain comment
                strings.

        Returns:
            A new CoNLLMetadata instance.
        """

        meta = {}
        general = []
        for comment in comments:
            if isinstance(comment, tuple):
                meta[comment[0]] = comment[1]
            else:
                general.append(comment)

        return cls(meta, general)

__contains__(metadata_key)

Check if a metadata key exists.

Parameters:

Name Type Description Default
metadata_key str

The key to check.

required

Returns:

Type Description
bool

True if the key exists in the key-value store.

Source code in meld/conll.py
def __contains__(self, metadata_key: str) -> bool:
    """
    Check if a metadata key exists.

    Args:
        metadata_key: The key to check.

    Returns:
        `True` if the key exists in the key-value store.
    """

    return metadata_key in self.meta

__getitem__(metadata_key)

Retrieve metadata by key.

Parameters:

Name Type Description Default
metadata_key str

The key to look up.

required

Returns:

Type Description
str

The metadata value.

Raises:

Type Description
KeyError

If the key is not found.

Source code in meld/conll.py
def __getitem__(self, metadata_key: str) -> str:
    """
    Retrieve metadata by key.

    Args:
        metadata_key: The key to look up.

    Returns:
        The metadata value.

    Raises:
        KeyError: If the key is not found.
    """

    return self.meta[metadata_key]

get_meta(metadata_key)

Retrieve metadata by key.

Parameters:

Name Type Description Default
metadata_key str

The key to look up.

required

Returns:

Type Description
str | None

The metadata value, or None if not found.

Source code in meld/conll.py
def get_meta(self, metadata_key: str) -> str | None:
    """
    Retrieve metadata by key.

    Args:
        metadata_key: The key to look up.

    Returns:
        The metadata value, or None if not found.
    """

    return self.meta.get(metadata_key)

with_key_value(comments) classmethod

Create CoNLLMetadata from an iterable of plain comments and parsed key-value pairs.

Parameters:

Name Type Description Default
comments Iterable[tuple[str, str] | str]

Iterable of tuples (key, value) or plain comment strings.

required

Returns:

Type Description
Self

A new CoNLLMetadata instance.

Source code in meld/conll.py
@classmethod
def with_key_value(cls, comments: Iterable[tuple[str, str] | str]) -> Self:
    """
    Create CoNLLMetadata from an iterable of plain comments and parsed key-value pairs.

    Args:
        comments: Iterable of tuples (key, value) or plain comment
            strings.

    Returns:
        A new CoNLLMetadata instance.
    """

    meta = {}
    general = []
    for comment in comments:
        if isinstance(comment, tuple):
            meta[comment[0]] = comment[1]
        else:
            general.append(comment)

    return cls(meta, general)

CoNLLStackOverflowNER dataclass

CoNLL format for the StackOverflow-NER dataset.

Attributes:

Name Type Description
docstart str

Class variable indicating the document start marker.

ignore_docstart bool

Class variable indicating whether to ignore document start markers.

form str

The token text.

ner BIOField

The NER tag in BIO format.

form_2 str

Secondary form field.

markdown BIOField

Markdown syntax metadata in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_stackoverflow_ner")
@dataclass(slots=True)
class CoNLLStackOverflowNER:
    """
    CoNLL format for the StackOverflow-NER dataset.

    Attributes:
        docstart: Class variable indicating the document start marker.
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
        form: The token text.
        ner: The NER tag in BIO format.
        form_2: Secondary form field.
        markdown: Markdown syntax metadata in BIO format.
    """

    docstart: ClassVar[str] = "-DOCSTART-"
    ignore_docstart: ClassVar[bool] = False

    form: str
    ner: BIOField
    form_2: str
    markdown: BIOField

CoNLLU dataclass

CoNLL-U format with 10 standard columns. For reference, see: https://universaldependencies.org/format.html

Attributes:

Name Type Description
id Annotated[int | Decimal | range, PlainValidator(conllu_id)]

Token ID (int, Decimal for empty nodes, or range for multi- word tokens).

form str

Token text.

lemma str

Lemma or stem.

upos MaybeUnderscore

Universal part-of-speech tag.

xpos MaybeUnderscore

Language-specific part-of-speech tag.

feats Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]

Dictionary of morphological features.

head Annotated[int | None, maybe_empty()]

Head token ID.

deprel MaybeUnderscore

Dependency relation.

deps Annotated[dict[int, str] | None, key_value_list(key_value_sep=':'), maybe_empty()]

Enhanced dependency graph in the form of a list of head-deprel pairs.

misc Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]

Dictionary of other miscellaneous information.

Source code in meld/conll.py
@dataclass(slots=True)
class CoNLLU:
    """
    CoNLL-U format with 10 standard columns. For reference, see: https://universaldependencies.org/format.html

    Attributes:
        id: Token ID (int, Decimal for empty nodes, or range for multi-
            word tokens).
        form: Token text.
        lemma: Lemma or stem.
        upos: Universal part-of-speech tag.
        xpos: Language-specific part-of-speech tag.
        feats: Dictionary of morphological features.
        head: Head token ID.
        deprel: Dependency relation.
        deps: Enhanced dependency graph in the form of a list of head-deprel pairs.
        misc: Dictionary of other miscellaneous information.
    """

    id: Annotated[int | Decimal | range, PlainValidator(conllu_id)]
    # Handle literal underscores as just underscores
    # Making this decision is left to the implementation according to https://universaldependencies.org/format.html
    form: str
    lemma: str
    upos: MaybeUnderscore
    xpos: MaybeUnderscore
    feats: Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]
    head: Annotated[int | None, maybe_empty()]
    deprel: MaybeUnderscore
    deps: Annotated[dict[int, str] | None, key_value_list(key_value_sep=":"), maybe_empty()]
    # Handle SpaceAfter=No
    misc: Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]

    @staticmethod
    def parse_comments(comments: list[str]) -> CoNLLMetadata:
        """
        Parse CoNLL-U style comments into CoNLLMetadata.

        Args:
            comments: List of comment strings starting with "#".

        Returns:
            Parsed CoNLLMetadata.
        """

        return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

    @staticmethod
    def is_document_start(metadata: CoNLLMetadata) -> bool:
        """
        Check if the metadata indicates the start of a new document.

        Args:
            metadata: The CoNLLMetadata to check.

        Returns:
            `True` if a "newdoc id" is present, indicating the start of
            a new document.
        """

        return metadata.get_meta("newdoc id") is not None

is_document_start(metadata) staticmethod

Check if the metadata indicates the start of a new document.

Parameters:

Name Type Description Default
metadata CoNLLMetadata

The CoNLLMetadata to check.

required

Returns:

Type Description
bool

True if a "newdoc id" is present, indicating the start of

bool

a new document.

Source code in meld/conll.py
@staticmethod
def is_document_start(metadata: CoNLLMetadata) -> bool:
    """
    Check if the metadata indicates the start of a new document.

    Args:
        metadata: The CoNLLMetadata to check.

    Returns:
        `True` if a "newdoc id" is present, indicating the start of
        a new document.
    """

    return metadata.get_meta("newdoc id") is not None

parse_comments(comments) staticmethod

Parse CoNLL-U style comments into CoNLLMetadata.

Parameters:

Name Type Description Default
comments list[str]

List of comment strings starting with "#".

required

Returns:

Type Description
CoNLLMetadata

Parsed CoNLLMetadata.

Source code in meld/conll.py
@staticmethod
def parse_comments(comments: list[str]) -> CoNLLMetadata:
    """
    Parse CoNLL-U style comments into CoNLLMetadata.

    Args:
        comments: List of comment strings starting with "#".

    Returns:
        Parsed CoNLLMetadata.
    """

    return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

CoNLLUPlus dataclass

CoNLL-U Plus format with NER tags used by the NYTK-NerKor dataset. For reference, see: https://universaldependencies.org/ext-format.html

Attributes:

Name Type Description
form str

The token text.

lemma str

Lemma or stem.

upos MaybeUnderscore

Universal part-of-speech tag.

xpos MaybeUnderscore

Language-specific part-of-speech tag.

feats Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]

Dictionary of morphological features.

ner BIOField

The NER tag in BIO format.

emmorph_lemma str

Lemma derived from the Hungarian emMorph morphological analyzer

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conllu_plus")
@dataclass(slots=True)
class CoNLLUPlus:
    """
    CoNLL-U Plus format with NER tags used by the NYTK-NerKor dataset. For reference, see: https://universaldependencies.org/ext-format.html

    Attributes:
        form: The token text.
        lemma: Lemma or stem.
        upos: Universal part-of-speech tag.
        xpos: Language-specific part-of-speech tag.
        feats: Dictionary of morphological features.
        ner: The NER tag in BIO format.
        emmorph_lemma: Lemma derived from the Hungarian emMorph
            morphological analyzer
    """

    form: str
    lemma: str
    upos: MaybeUnderscore
    xpos: MaybeUnderscore
    feats: Annotated[dict[str, str] | None, key_value_list(), maybe_empty()]
    ner: BIOField
    emmorph_lemma: str

    @staticmethod
    def parse_comments(comments: list[str]) -> CoNLLMetadata:
        """
        Parse CoNLL-U style comments into CoNLLMetadata.

        Args:
            comments: List of comment strings starting with "#".

        Returns:
            Parsed CoNLLMetadata.
        """

        return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

    @staticmethod
    def is_document_start(_: CoNLLMetadata) -> bool:
        """
        Check if the metadata indicates the start of a new document.

        Returns:
            Always returns True for CoNLLUPlus.
        """

        return True

is_document_start(_) staticmethod

Check if the metadata indicates the start of a new document.

Returns:

Type Description
bool

Always returns True for CoNLLUPlus.

Source code in meld/conll.py
@staticmethod
def is_document_start(_: CoNLLMetadata) -> bool:
    """
    Check if the metadata indicates the start of a new document.

    Returns:
        Always returns True for CoNLLUPlus.
    """

    return True

parse_comments(comments) staticmethod

Parse CoNLL-U style comments into CoNLLMetadata.

Parameters:

Name Type Description Default
comments list[str]

List of comment strings starting with "#".

required

Returns:

Type Description
CoNLLMetadata

Parsed CoNLLMetadata.

Source code in meld/conll.py
@staticmethod
def parse_comments(comments: list[str]) -> CoNLLMetadata:
    """
    Parse CoNLL-U style comments into CoNLLMetadata.

    Args:
        comments: List of comment strings starting with "#".

    Returns:
        Parsed CoNLLMetadata.
    """

    return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

CoNLLWeiboNER dataclass

CoNLL-style format from the WeiboNER dataset in which word position indices are stored as part of the word form column. Word position indices are removed during deserialization.

Attributes:

Name Type Description
form Annotated[str, BeforeValidator(_strip_word_position)]

The token form/word with word position indices removed.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_weibo_ner")
@dataclass(slots=True)
class CoNLLWeiboNER:
    """
    CoNLL-style format from the WeiboNER dataset in which word position indices are stored as part of the word form column.
    Word position indices are removed during deserialization.

    Attributes:
        form: The token form/word with word position indices removed.
        ner: The NER tag in BIO format.
    """

    form: Annotated[str, BeforeValidator(_strip_word_position)]
    ner: BIOField

CoNLLWithPOS dataclass

CoNLL-style format with POS and NER tags.

Attributes:

Name Type Description
form str

The token text.

pos str

The part-of-speech tag.

ner BIOField

The NER tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conll_with_pos")
@dataclass(slots=True)
class CoNLLWithPOS:
    """
    CoNLL-style format with POS and NER tags.

    Attributes:
        form: The token text.
        pos: The part-of-speech tag.
        ner: The NER tag in BIO format.
    """

    form: str
    pos: str
    ner: BIOField

ColumnarCoNLL

Bases: Protocol

Protocol for CoNLL-style formats that support column-based segmentation.

Source code in meld/conll.py
@runtime_checkable
class ColumnarCoNLL(Protocol):
    """
    Protocol for CoNLL-style formats that support column-based segmentation."""

    @staticmethod
    def segment_columns(rows: Iterable[dict[str, Any]]) -> Iterable[list[dict[str, Any]]]:
        """
        Segment sentence indexed rows into sentence-level groups.

        Args:
            rows: Iterable of indexed rows.

        Returns:
            Iterable of sentence-level row groups.
        """
        ...

segment_columns(rows) staticmethod

Segment sentence indexed rows into sentence-level groups.

Parameters:

Name Type Description Default
rows Iterable[dict[str, Any]]

Iterable of indexed rows.

required

Returns:

Type Description
Iterable[list[dict[str, Any]]]

Iterable of sentence-level row groups.

Source code in meld/conll.py
@staticmethod
def segment_columns(rows: Iterable[dict[str, Any]]) -> Iterable[list[dict[str, Any]]]:
    """
    Segment sentence indexed rows into sentence-level groups.

    Args:
        rows: Iterable of indexed rows.

    Returns:
        Iterable of sentence-level row groups.
    """
    ...

CommentedCoNLL

Bases: Protocol

Protocol for CoNLL-style formats that support comment-based metadata.

Source code in meld/conll.py
@runtime_checkable
class CommentedCoNLL(Protocol):
    """
    Protocol for CoNLL-style formats that support comment-based metadata."""

    @staticmethod
    def parse_comments(comments: list[str]) -> CoNLLMetadata:
        """
        Parse CoNLL metadata from comments.

        Args:
            comments: List of comment strings.

        Returns:
            Parsed CoNLLMetadata.
        """
        ...

    @staticmethod
    def is_document_start(metadata: CoNLLMetadata) -> bool:
        """
        Check if metadata indicates the start of a new document.

        Args:
            metadata: The CoNLLMetadata to check.

        Returns:
            True at the start of a new document.
        """
        ...

is_document_start(metadata) staticmethod

Check if metadata indicates the start of a new document.

Parameters:

Name Type Description Default
metadata CoNLLMetadata

The CoNLLMetadata to check.

required

Returns:

Type Description
bool

True at the start of a new document.

Source code in meld/conll.py
@staticmethod
def is_document_start(metadata: CoNLLMetadata) -> bool:
    """
    Check if metadata indicates the start of a new document.

    Args:
        metadata: The CoNLLMetadata to check.

    Returns:
        True at the start of a new document.
    """
    ...

parse_comments(comments) staticmethod

Parse CoNLL metadata from comments.

Parameters:

Name Type Description Default
comments list[str]

List of comment strings.

required

Returns:

Type Description
CoNLLMetadata

Parsed CoNLLMetadata.

Source code in meld/conll.py
@staticmethod
def parse_comments(comments: list[str]) -> CoNLLMetadata:
    """
    Parse CoNLL metadata from comments.

    Args:
        comments: List of comment strings.

    Returns:
        Parsed CoNLLMetadata.
    """
    ...

DocumentSeparatedCoNLL

Bases: Protocol

Protocol for CoNLL-style formats separated by document markers.

Attributes:

Name Type Description
docstart str

Class variable indicating the document start marker.

ignore_docstart bool

Class variable indicating whether to ignore document start markers.

Source code in meld/conll.py
@runtime_checkable
class DocumentSeparatedCoNLL(Protocol):
    """
    Protocol for CoNLL-style formats separated by document markers.

    Attributes:
        docstart: Class variable indicating the document start marker.
        ignore_docstart: Class variable indicating whether to ignore
            document start markers.
    """

    docstart: ClassVar[str]
    ignore_docstart: ClassVar[bool]

FlatIndices dataclass

CoNLL-style columnar format for data without blank lines, indicating documents and sentences via index columns.

Attributes:

Name Type Description
ner int

NER label.

form str

The token text.

doc_idx int

Document index.

sent_idx int

Sentence index.

Source code in meld/conll.py
@CoNLLColumnarRegistry.register("flat")
@dataclass(slots=True)
class FlatIndices:
    """
    CoNLL-style columnar format for data without blank lines, indicating documents and sentences via index columns.

    Attributes:
        ner: NER label.
        form: The token text.
        doc_idx: Document index.
        sent_idx: Sentence index.
    """

    ner: int
    form: str
    doc_idx: int
    sent_idx: int

    @staticmethod
    def segment_columns(rows: Iterable[dict[str, Any]]) -> Iterable[list[dict[str, Any]]]:
        """
        Segment flat index rows into sentence-level groups.

        Args:
            rows: Iterable of flat index rows.

        Returns:
            Iterable of sentence-level row groups.
        """

        current_index = 0
        sentence_rows = []
        for row in rows:
            index = row["sent_idx"]
            if index != current_index:
                current_index = index
                yield sentence_rows
                sentence_rows = []

            sentence_rows.append(row)

        if sentence_rows:
            yield sentence_rows

segment_columns(rows) staticmethod

Segment flat index rows into sentence-level groups.

Parameters:

Name Type Description Default
rows Iterable[dict[str, Any]]

Iterable of flat index rows.

required

Returns:

Type Description
Iterable[list[dict[str, Any]]]

Iterable of sentence-level row groups.

Source code in meld/conll.py
@staticmethod
def segment_columns(rows: Iterable[dict[str, Any]]) -> Iterable[list[dict[str, Any]]]:
    """
    Segment flat index rows into sentence-level groups.

    Args:
        rows: Iterable of flat index rows.

    Returns:
        Iterable of sentence-level row groups.
    """

    current_index = 0
    sentence_rows = []
    for row in rows:
        index = row["sent_idx"]
        if index != current_index:
            current_index = index
            yield sentence_rows
            sentence_rows = []

        sentence_rows.append(row)

    if sentence_rows:
        yield sentence_rows

NerSuiteCoNLL dataclass

NERSuite CoNLL-style format with character indices mapping tokens to the source text. For reference, see: https://nersuite.nlplab.org/advanced_usage.html

Attributes:

Name Type Description
ner BIOField

The NER tag in BIO format.

start int

Start character offset of the token in the source text.

end int

End character offset of the token in the source text.

form str

The token text.

lemma str

Lemma or stem.

pos MaybeUnderscore

Part-of-speech tag.

chunk BIOField

Chunk tag in BIO format.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("ner_suite")
@dataclass(slots=True)
class NerSuiteCoNLL:
    """
    NERSuite CoNLL-style format with character indices mapping tokens to the source text. For reference, see: https://nersuite.nlplab.org/advanced_usage.html

    Attributes:
        ner: The NER tag in BIO format.
        start: Start character offset of the token in the source text.
        end: End character offset of the token in the source text.
        form: The token text.
        lemma: Lemma or stem.
        pos: Part-of-speech tag.
        chunk: Chunk tag in BIO format.
    """

    ner: BIOField
    start: int
    end: int
    form: str
    lemma: str
    pos: MaybeUnderscore
    chunk: BIOField

RowParser

Parser for CoNLL-style rows supporting arbitrary CoNLL dialects.

Type Parameters:

T The CoNLL dialect class used for parsing rows.

Parameters:

Name Type Description Default
dialect type[T]

The CoNLL dialect class to parse rows with.

required
Source code in meld/conll.py
class RowParser[T]:
    """
    Parser for CoNLL-style rows supporting arbitrary CoNLL dialects.

    **Type Parameters:**

    T
        The CoNLL dialect class used for parsing rows.

    Args:
        dialect: The CoNLL dialect class to parse rows with.
    """

    def __init__(self, dialect: type[T]) -> None:
        self.row_parser = TypeAdapter(dialect)
        fields = dataclasses.fields(dialect)  # type: ignore
        self.fields = [field.name for field in fields]

    def validate_row(self, row: list[str]) -> T:
        """
        Validate and parse a row using the parser's CoNLL dialect.

        Args:
            row: List of fields in the row.

        Returns:
            Parsed Sentence using the parser's CoNLL dialect.
        """

        return self.row_parser.validate_python({field: value for field, value in zip(self.fields, row)})

validate_row(row)

Validate and parse a row using the parser's CoNLL dialect.

Parameters:

Name Type Description Default
row list[str]

List of fields in the row.

required

Returns:

Type Description
T

Parsed Sentence using the parser's CoNLL dialect.

Source code in meld/conll.py
def validate_row(self, row: list[str]) -> T:
    """
    Validate and parse a row using the parser's CoNLL dialect.

    Args:
        row: List of fields in the row.

    Returns:
        Parsed Sentence using the parser's CoNLL dialect.
    """

    return self.row_parser.validate_python({field: value for field, value in zip(self.fields, row)})

Sentence dataclass

A parsed CoNLL-style sentence with optional metadata.

Type Parameters:

T The type of rows in the sentence.

Attributes:

Name Type Description
rows list[T]

List of parsed CoNLL-style rows.

meta CoNLLMetadata | None

Optional CoNLLMetadata for the sentence.

Source code in meld/conll.py
@dataclass(slots=True)
class Sentence[T: RowType]:
    """
    A parsed CoNLL-style sentence with optional metadata.

    **Type Parameters:**

    T
        The type of rows in the sentence.

    Attributes:
        rows: List of parsed CoNLL-style rows.
        meta: Optional CoNLLMetadata for the sentence.
    """

    rows: list[T]
    meta: CoNLLMetadata | None = None

UNERCoNLLU dataclass

Universal NER CoNLL-U-style format with UNER tags and (optionally) original NER tags.

Attributes:

Name Type Description
id int | Decimal

Token ID.

form str

The token text.

ner BIOField

The NER tag in BIO format.

original_ner Annotated[list[BIO] | None, BeforeValidator(_maybe_multi_label), maybe_empty('-')]

Original NER tags for converted datasets.

annotator MaybeHyphen

Optional name of the annotator.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conllu_uner")
@dataclass(slots=True)
class UNERCoNLLU:
    """
    Universal NER CoNLL-U-style format with UNER tags and (optionally) original NER tags.

    Attributes:
        id: Token ID.
        form: The token text.
        ner: The NER tag in BIO format.
        original_ner: Original NER tags for converted datasets.
        annotator: Optional name of the annotator.
    """

    id: int | Decimal
    form: str
    ner: BIOField
    original_ner: Annotated[list[BIO] | None, BeforeValidator(_maybe_multi_label), maybe_empty("-")]
    annotator: MaybeHyphen

    @staticmethod
    def parse_comments(comments: list[str]) -> CoNLLMetadata:
        """
        Parse CoNLL-U style comments into CoNLLMetadata.

        Args:
            comments: List of comment strings starting with "#".

        Returns:
            Parsed CoNLLMetadata.
        """

        return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

    @staticmethod
    def is_document_start(metadata: CoNLLMetadata) -> bool:
        """
        Check if the metadata indicates the start of a new document.

        Args:
            metadata: The CoNLLMetadata to check.

        Returns:
            `True` if a "newdoc id" is present, indicating the start of
            a new document.
        """

        return metadata.get_meta("newdoc id") is not None

is_document_start(metadata) staticmethod

Check if the metadata indicates the start of a new document.

Parameters:

Name Type Description Default
metadata CoNLLMetadata

The CoNLLMetadata to check.

required

Returns:

Type Description
bool

True if a "newdoc id" is present, indicating the start of

bool

a new document.

Source code in meld/conll.py
@staticmethod
def is_document_start(metadata: CoNLLMetadata) -> bool:
    """
    Check if the metadata indicates the start of a new document.

    Args:
        metadata: The CoNLLMetadata to check.

    Returns:
        `True` if a "newdoc id" is present, indicating the start of
        a new document.
    """

    return metadata.get_meta("newdoc id") is not None

parse_comments(comments) staticmethod

Parse CoNLL-U style comments into CoNLLMetadata.

Parameters:

Name Type Description Default
comments list[str]

List of comment strings starting with "#".

required

Returns:

Type Description
CoNLLMetadata

Parsed CoNLLMetadata.

Source code in meld/conll.py
@staticmethod
def parse_comments(comments: list[str]) -> CoNLLMetadata:
    """
    Parse CoNLL-U style comments into CoNLLMetadata.

    Args:
        comments: List of comment strings starting with "#".

    Returns:
        Parsed CoNLLMetadata.
    """

    return CoNLLMetadata.with_key_value(parse_key_value(comment) for comment in comments)

UNERCoNLLUNewPar dataclass

Bases: UNERCoNLLU

Variant of the Universal NER CoNLL-U-style format using newpar comments as a document separator

Attributes:

Name Type Description
id int | Decimal

Token ID.

form str

The token text.

ner BIOField

The NER tag in BIO format.

original_ner Annotated[list[BIO] | None, BeforeValidator(_maybe_multi_label), maybe_empty('-')]

Original NER tags for converted datasets.

annotator MaybeHyphen

Optional name of the annotator.

Source code in meld/conll.py
@CoNLLDialectRegistry.register("conllu_uner_newpar")
@dataclass(slots=True)
class UNERCoNLLUNewPar(UNERCoNLLU):
    """
    Variant of the Universal NER CoNLL-U-style format using `newpar` comments as a document separator

    Attributes:
        id: Token ID.
        form: The token text.
        ner: The NER tag in BIO format.
        original_ner: Original NER tags for converted datasets.
        annotator: Optional name of the annotator.
    """

    @staticmethod
    def is_document_start(metadata: CoNLLMetadata) -> bool:
        """
        Check if the metadata indicates the start of a new document.

        Args:
            metadata: The CoNLLMetadata to check.

        Returns:
            `True` if a "newpar" comment is present, indicating the start of
            a new document.
        """

        return any(comment.strip() == "newpar" for comment in metadata.comments)

is_document_start(metadata) staticmethod

Check if the metadata indicates the start of a new document.

Parameters:

Name Type Description Default
metadata CoNLLMetadata

The CoNLLMetadata to check.

required

Returns:

Type Description
bool

True if a "newpar" comment is present, indicating the start of

bool

a new document.

Source code in meld/conll.py
@staticmethod
def is_document_start(metadata: CoNLLMetadata) -> bool:
    """
    Check if the metadata indicates the start of a new document.

    Args:
        metadata: The CoNLLMetadata to check.

    Returns:
        `True` if a "newpar" comment is present, indicating the start of
        a new document.
    """

    return any(comment.strip() == "newpar" for comment in metadata.comments)

conllu_id(token_id)

Parses CoNLL-U token IDs.

Parameters:

Name Type Description Default
token_id str | float | Decimal

Token ID string representation (may contain hyphen for multi-word tokens or decimal for empty nodes).

required

Returns:

Type Description
int | Decimal | range

An integer for simple IDs, Decimal for empty nodes, or range for multi-

int | Decimal | range

word tokens.

Raises:

Type Description
ValueError

When a decimal ID is less than or equal to 0.

Source code in meld/conll.py
def conllu_id(token_id: str | float | Decimal) -> int | Decimal | range:
    """
    Parses CoNLL-U token IDs.

    Args:
        token_id: Token ID string representation (may contain hyphen for
            multi-word tokens or decimal for empty nodes).

    Returns:
        An integer for simple IDs, Decimal for empty nodes, or range for multi-
        word tokens.

    Raises:
        ValueError: When a decimal ID is less than or equal to 0.
    """

    if isinstance(token_id, str):
        if len(id_range := token_id.split("-", 1)) > 1:
            return range(*map(int, id_range))
        try:
            return int(token_id)
        except ValueError:
            decimal = Decimal(token_id)
            if decimal <= 0:
                raise ValueError("CoNLL-U decimal IDs have to be greater than 0")
            return decimal
    if isinstance(token_id, float):
        if token_id <= 0:
            raise ValueError("CoNLL-U decimal IDs have to be greater than 0")
        return Decimal(token_id)

    return token_id

key_value_list(list_sep='|', key_value_sep='=', allow_empty=True)

Create a Pydantic BeforeValidator for parsing key-value list strings into dictionaries.

The input string is parsed as a list of key-value pairs, where each pair is separated by list_sep and the key and value within each pair are separated by key_value_sep.

Parameters:

Name Type Description Default
list_sep str

Separator between key-value pairs.

'|'
key_value_sep str

Separator between keys and values.

'='
allow_empty bool

Whether an empty list input should be allowed or raise a ValueError.

True

Returns:

Type Description
BeforeValidator

A Pydantic BeforeValidator that parses key-value list strings.

Raises:

Type Description
ValueError

If list_sep and key_value_sep are the same.

ValueError

In the returned validator if the input is None and allow_empty is False.

Source code in meld/conll.py
def key_value_list(list_sep: str = "|", key_value_sep: str = "=", allow_empty: bool = True) -> BeforeValidator:
    """
    Create a Pydantic BeforeValidator for parsing key-value list strings into dictionaries.

    The input string is parsed as a list of key-value pairs, where each pair is separated
    by `list_sep` and the key and value within each pair are separated by `key_value_sep`.

    Args:
        list_sep: Separator between key-value pairs.
        key_value_sep: Separator between keys and values.
        allow_empty: Whether an empty list input should be allowed or
            raise a `ValueError`.

    Returns:
        A Pydantic BeforeValidator that parses key-value list strings.

    Raises:
        ValueError: If `list_sep` and `key_value_sep` are the same.
        ValueError: In the returned validator if the input is `None` and `allow_empty` is `False`.
    """

    if list_sep == key_value_sep:
        raise ValueError(f"List and key value separator need to differ but where {list_sep!r} == {key_value_sep!r}")

    def transform(string: str | None) -> dict[str, str] | None:
        if string is None:
            if not allow_empty:
                raise ValueError("Key-value list can not be empty")
            return None

        return dict(pair.split(key_value_sep, 1) for pair in string.split(list_sep))

    return BeforeValidator(transform)

maybe_empty(empty_value='_')

Create a Pydantic BeforeValidator that converts strings containing a CoNLL-style empty value to None.

The validator will raise a ValueError if the input string is the empty string unless empty_value is also set to the empty string

Parameters:

Name Type Description Default
empty_value str

The string value representing empty fields.

'_'

Returns:

Type Description
BeforeValidator

A Pydantic BeforeValidator that transforms empty values to None.

Source code in meld/conll.py
def maybe_empty(empty_value: str = "_") -> BeforeValidator:
    """
    Create a Pydantic BeforeValidator that converts strings containing a CoNLL-style empty value to `None`.

    The validator will raise a `ValueError` if the input string is the empty string unless `empty_value` is also set to the empty string

    Args:
        empty_value: The string value representing empty fields.

    Returns:
        A Pydantic BeforeValidator that transforms empty values to None.
    """

    def transform(string: str) -> str | None:
        if not string and empty_value:
            raise ValueError(f"Value must be non-empty. All empty values must contain {empty_value!r} explicitly")
        return string if string != empty_value else None

    return BeforeValidator(transform)

parse(lines, dialect=CoNLL, delimiter='\t', enforce_blank_lines=True, use_comment_document_boundary=True)

Parse CoNLL-style lines into documents with sentences.

Type Parameters:

T The CoNLL dialect class used for parsing rows.

Parameters:

Name Type Description Default
lines Iterable[str]

Iterable of line strings.

required
dialect type[T]

The CoNLL dialect class to use for parsing.

CoNLL
delimiter str

Field delimiter separating the CoNLL-style columns.

'\t'
enforce_blank_lines bool

Whether to enforce that blank lines between segments are empty. Otherwise, lines that contain only whitespace will be treated as blank

True
use_comment_document_boundary bool

Whether to parse document boundaries from CoNLL-U-style comment.

True

Returns:

Type Description
Iterator[list[Sentence[T]]]

Iterator over documents (lists of sentences).

Raises:

Type Description
NotImplementedError

When combining -DOCSTART- with CoNLL-U style comments.

Source code in meld/conll.py
def parse[T: RowType](
    lines: Iterable[str],
    dialect: type[T] = CoNLL,
    delimiter: str = "\t",
    enforce_blank_lines: bool = True,
    use_comment_document_boundary: bool = True,
) -> Iterator[list[Sentence[T]]]:
    """
    Parse CoNLL-style lines into documents with sentences.

    **Type Parameters:**

    T
        The CoNLL dialect class used for parsing rows.

    Args:
        lines: Iterable of line strings.
        dialect: The CoNLL dialect class to use for parsing.
        delimiter: Field delimiter separating the CoNLL-style columns.
        enforce_blank_lines: Whether to enforce that blank lines between
            segments are empty. Otherwise, lines that contain only
            whitespace will be treated as blank
        use_comment_document_boundary: Whether to parse document
            boundaries from CoNLL-U-style comment.

    Returns:
        Iterator over documents (lists of sentences).

    Raises:
        NotImplementedError: When combining -DOCSTART- with CoNLL-U
            style comments.
    """

    if isinstance(dialect, CommentedCoNLL):
        yield from _parse_with_comments(lines, dialect, delimiter, use_comment_document_boundary)
        return

    parser = RowParser(dialect)

    if isinstance(dialect, DocumentSeparatedCoNLL):
        docstart = dialect.docstart
        ignore_docstart = dialect.ignore_docstart
    else:
        docstart = None
        ignore_docstart = True

    document_sentences = []
    for segment in space_separated_segments(lines, enforce_blank_lines):
        if docstart is not None and segment[0].startswith(docstart):
            # Ignore and skip -DOCSTART- lines
            if ignore_docstart:
                continue

            # Avoids yielding an empty sentence at the start of the file
            if document_sentences:
                yield document_sentences
                document_sentences = []

            # Skip docstart only segments entirely and otherwise remove the docstart header
            if len(segment) == 1:
                continue

            segment = segment[1:]

        sentence = Sentence([parser.validate_row(line.split(delimiter)) for line in segment])

        if ignore_docstart:
            yield [sentence]
        else:
            document_sentences.append(sentence)

    # Handle trailing document
    if document_sentences:
        yield document_sentences

parse_columns(rows, dialect)

Parse columnar rows into sentences using the specified CoNLL dialect.

Type Parameters:

C The CoNLL dialect class used for parsing rows.

Parameters:

Name Type Description Default
rows Iterable[dict[str, Any]]

Iterable of row dictionaries.

required
dialect type[C]

The CoNLL dialect class to use for parsing.

required

Yields: Parsed sentences.

Source code in meld/conll.py
def parse_columns[C: ColumnarCoNLL](rows: Iterable[dict[str, Any]], dialect: type[C]) -> Iterator[Sentence[C]]:
    """
    Parse columnar rows into sentences using the specified CoNLL dialect.

    **Type Parameters:**

    C
        The CoNLL dialect class used for parsing rows.

    Args:
        rows: Iterable of row dictionaries.
        dialect: The CoNLL dialect class to use for parsing.
    Yields:
        Parsed sentences.
    """

    row_parser = TypeAdapter(dialect)

    for segment in dialect.segment_columns(rows):
        yield Sentence([row_parser.validate_python(row) for row in segment])

parse_key_value(comment)

Parses a CoNLL-U comment line into a key-value pair if possible and returns the input string otherwise.

Parameters:

Name Type Description Default
comment str

Comment line from CoNLL-U format.

required

Returns:

Type Description
tuple[str, str] | str

(key, value) tuple for key=value format, or a plain string

tuple[str, str] | str

otherwise.

Source code in meld/conll.py
def parse_key_value(comment: str) -> tuple[str, str] | str:
    """
    Parses a CoNLL-U comment line into a key-value pair if possible and returns the input string otherwise.

    Args:
        comment: Comment line from CoNLL-U format.

    Returns:
        (key, value) tuple for key=value format, or a plain string
        otherwise.
    """

    match = re.fullmatch(r"#\s*([^=]+?)\s*=\s*(.+?)\s*", comment)
    if match is None:
        return comment.removeprefix("#").lstrip()

    key, value = match.groups()
    return key, value

space_separated_segments(lines, enforce_blank_lines=True)

Split lines into segments separated by blank lines.

Parameters:

Name Type Description Default
lines Iterable[str]

Iterable of line strings.

required
enforce_blank_lines bool

Whether to enforce blank lines as segment separators.

True

Yields: Segments as lists of lines.

Source code in meld/conll.py
def space_separated_segments(lines: Iterable[str], enforce_blank_lines: bool = True) -> Iterator[list[str]]:
    """
    Split lines into segments separated by blank lines.

    Args:
        lines: Iterable of line strings.
        enforce_blank_lines: Whether to enforce blank lines as segment
            separators.
    Yields:
        Segments as lists of lines.
    """

    allow_whitespace_in_blank = not enforce_blank_lines
    current_segment = []
    for line in lines:
        line = line.strip("\r\n")
        if not line or (allow_whitespace_in_blank and not line.strip()):
            if current_segment:
                yield current_segment
                current_segment = []
        else:
            current_segment.append(line)

    # Handle final segment for non-compliant files that lack a final blank line
    if current_segment:
        yield current_segment