farsinlp.github.io

Datasets for Farsi (Persian) Natural Language Processing (NLP)

Text Summarization

Text summarization is the process of shortening a document to a smaller summary that represents the most important information from the original document. Summarization can be either extractive or abstractive. The former constructs a summary by concatenating extracts taken from the original document whereas the latter involves a generation phase (e.g., paraphrasing) to produce new texts (which are not necessarily from the original document).

Title Description
Persian abstractive text summarization
[website]
A dataset for Farsi abstractive/extractive summarization tasks (like cnn_dailymail for English) with 93,207 records.
Tebyan Dataset
[website]
Tebyan Dataset accumulates 92,289 documentsummary pairs collected from the Tebyan website. These articles consist of various subjects and are not limited to news articles.
Wiki Summary
[website]
A summarization dataset extracted from Persian Wikipedia.