farsinlp.github.io

Datasets for Farsi (Persian) Natural Language Processing (NLP)

Parallel Corpora

A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages.

Title	Description
MIZAN [website] [download]	A Persian-English parallel corpus with about 1 million sentence pairs collected from masterpieces of literature.
PEPC: Parallel English-Persian Corpus [website] [download_bidirectional] [download_onedirectional]	PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.
TEP: Tehran English-Persian parallel corpus [website] [download]	The first free English-Persian corpus, provided by the Natural Language and Text Processing Laboratory, University of Tehran.