Datasets for Farsi (Persian) Natural Language Processing (NLP)

Raw Text Corpora

Title Description
Persian Wikipedia Corpus
[website] [download]
A complete copy of Persian Wikimedia pages, in the form of plain text (without wikitext markup) and metadata embedded in JSON. The current version is as fawiki-20181001-dump that it contains 1,160,676 useful articles.
[website] [paper] [download]
MirasText is the result of crawling more than 250 persain news websites. MirasText has more than 2.8 million articles and over 1.4 billion content words.
Farsi News Datasets
[website] [download_hamshahri] [download_radiofarda]
These datasets have been extracted from the RSS feed of two Farsi news agency websites: Hamshahri and RadioFarda.
VOA Corpus
[website] [download]
A Farsi corpus with 7.9 million words, extracted from VOA, during years 2003-2008.
A large collection of Persian raw text
About 80GB Persian raw text, collected from a variety of sources, particularly CommonCrawl.
W2C – Web to Corpus – Corpora
A set of corpora for 120 languages automatically collected from wikipedia and the web.
dotIR Collection
dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web. dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.
irBlogs contains 5,000,000+ posts and a relations graph belonging to more than 600,000 Persian weblogs. It can be used in different applications like information retrieval, studying the Persian language in online social networks and even graph theory algorithms. Also, 45 queries and their relevance judgments are created by different users by use of UTIRE evaluation system. Different weblog retrieval algorithms are employed to create the judgment pool and totally 24339 weblogs are judged by the users (on average 540 weblogs for each query).
Persian Poems Corpus
This corpus consists of text documents for 48 Persian poets. The corpus comes in three formats; original, normalized (only 32 main Farsi alphabet), and stop words removed. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.
This corpus consists of text documents for 12 Persian poets that was crawled from ganjoor.net website.
Naab contains about 130GB of data, 250 million paragraphs, and 15 billion words.