Datasets for Farsi (Persian) Natural Language Processing (NLP)

Parallel Corpora

A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages.

A Persian-English parallel corpus with about 1 million sentence pairs collected from masterpieces of literature.
PEPC: Parallel English-Persian Corpus
[website] [download_bidirectional] [download_onedirectional]
PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.
TEP: Tehran English-Persian parallel corpus
[website] [download]
The first free English-Persian corpus, provided by the Natural Language and Text Processing Laboratory, University of Tehran.