Datasets for Farsi (Persian) Natural Language Processing (NLP)

Text Summarization

Text summarization is the process of shortening a document to a smaller summary that represents the most important information from the original document. Summarization can be either extractive or abstractive. The former constructs a summary by concatenating extracts taken from the original document whereas the latter involves a generation phase (e.g., paraphrasing) to produce new texts (which are not necessarily from the original document).

Title Description
Persian abstractive text summarization
A dataset for Farsi abstractive/extractive summarization tasks (like cnn_dailymail for English) with 93,207 records.
Tebyan Dataset
Tebyan Dataset accumulates 92,289 documentsummary pairs collected from the Tebyan website. These articles consist of various subjects and are not limited to news articles.
Wiki Summary
A summarization dataset extracted from Persian Wikipedia.