We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Once data has been obtained, the procedure to generate the dataset and reproduce the presented experiments is described in the code repository available here. All information on how to request this data are described on the following website. Articles are annotated with various hierarchical labels, including topic codes which have been used as a general descriptor of the article’s content. Volume 1 contains only English articles, while Volume 2 contains articles in 13 different languages. The Reuters Corpus Volume 1/2 is a collection of Reuters News stories that can be used for research purposes. GUID: 2C9824D1-30DF-427E-B0A6-BD232C224560 Data Availability StatementĪll data and code used for the experimental session have been linked in the article, except for Reuters Corpus Volume 1 and 2, which, as a dataset owned by a third party, cannot be shared, but must be requested to the National Institute of Standards and Technology (NIST).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |