farsinlp.github.io

Datasets for Farsi (Persian) Natural Language Processing (NLP)

Pre-trained Language Models

The intuition behind pre-trained language models is to create a black box which understands the language and can then be asked to do any specific task in that language.

Title	Description
ParsBERT [website] [paper]	ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words.
ParsPer [website] [download]	ParsPer is created by training the graph-based MateParser on the entire Uppsala Persian Dependency Treebank (UPDT) with a selected configuration.
GPT-2 Persian [website] [demo]	Bolbolzaban/gpt2-persian is a generative language model that is trained with hyper parameters similar to standard gpt2-medium with two differences on 27GB of texts collected from different Farsi websites. More details
ALBERT-Persian [website]	The model was trained based on Google’s ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, similarly to ParsBERT. More details
ARMAN [website]	ARMAN is a language model with specifically designed pre-training objectives to perform well in Persian abstractive summarization. More details