Datasets for Farsi (Persian) Natural Language Processing (NLP)

Dependency parsing

Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads.


      | +-------dobj---------+
      | |                    |
nsubj | |   +------det-----+ | +-----nmod------+
+--+  | |   |              | | |               |
|  |  | |   |      +-nmod-+| | |      +-case-+ |
+  |  + |   +      +      || + |      +      | |
I  prefer  the  morning   flight  through  Denver

Relations among the words are illustrated above the sentence with directed, labeled arcs from heads to dependents (+ indicates the dependent).

Title Description
Persian CoNLL 2017
[website] [download]
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts, provided for the CoNLL 2017 Shared Task in UD Parsing.
Uppsala Persian Dependency Treebank: UPDT
[website] [download]
UPDT is a dependency-based syntactically annotated corpus. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format and is developed through a bootstrapping procedure involving the open source data-driven dependency parser MaltParser, and manual validation of the annotation.
Persian Syntactic Dependency Treebank
This treebank has 29,982 annotated sentences including samples from almost all verbs of the Persian valency lexicon.
The Persian Universal Dependency Treebank (PerUDT)
PerUDT is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. The original Treebank consists of 29K sentences sampled from contemporary Persian text in different genres including: news, academic papers, magazine articles and fictions.