Datasets for Farsi (Persian) Natural Language Processing (NLP)

Text Classification

Text classification is the task of assigning a sentence or document an appropriate category. The categories depend on the chosen dataset and can range from topics.

Title Description
News Category Dataset (bigdata-ir.com)
[website] [download]
About twenty thousand news articles with thematic classification and hierarchical clustering
News Categorization by Date
[website] [download]
News database of about four thousand Persian news categorized by date
DK Dataset-2 User Comments
This data contains one hundred thousand samples of user comments that include several comments for a product. Uses of this data include natural language processing, classification based on comment quality, spam detection, and psychological analysis.
DK Dataset-4 Product Reviews Quality
This data contains the history of more than one hundred thousand products. Research proposals include anomaly detection, future price forecasting, price statistical analysis and stability among classifications, and the use of machine learning to identify incorrect prices by vendors.
DK Dataset-5 Products List
This data includes one hundred thousand samples of products and their classification. Suggested applications for this database are classification prediction, anomaly detection, categorization error detection, duplicate detection, and dynamic categorization using data attributes.