DEVELOPMENT OF LEXICA-SEMANTIC MARKING OF VERBS IN THE NATIONAL CORPUS OF THE KAZAKH LANGUAGE: WORLD EXPERIENCE, CLASSIFICATION, MARKUP IN THE CORPUS
DOI:
https://doi.org/10.48371/PHILS.2022.66.3.010Keywords:
corpus linguistics, markup, lexica-semantic classification, verb, semantics, meaning, category, linguistic annotationAbstract
The article deals with the problem of developing lexica-semantic markup, one of the main markups in the world practice of building a corpus. In particular, a review of the works of domestic and foreign scientists related to computational linguistics and lexica-semantic classification will be carried out, the stages of creating lexica-semantic markup of verbs in the National Corpus of the Kazakh language will be shown, and the practical basis will be explained.
The accelerated development of information technology requires the mastery of electronic resources in all branches of science, including linguistics. The corpus linguistics is the field of linguistics that studies and implements language programming. The creation of the National Corpus of the Kazakh language is based on the creation of markups, which automatically analyze each language level. One of the complex markups in linguistic annotation of words is lexical-semantic markup. Compared to the corpus of Russian, Kalmyk and other languages, the lexical-semantic markup in the National Corpus of the Kazakh language deepens into the meaning of the word, i.e. into the sema. Therefore, the number of small (individual) lexica-semantic groups amounted to 72 groups. This allows the user to more accurately find the information he needs. The interface for using the markup system should be easy and understandable for any user, both a specialist in this field and specialists in other areas who are just learning to use it. Accordingly, lexica-semantic groups are given short and specific names.
The base of the corpus includes 18, 200 verbs, their semantic shades are being studied. In the course of the study, it was proposed to characterize verbs according to five different features in the lexica-semantic framework. First: by word-formation character, single, complex; main, derivative; the second: on the basis of the lexical and grammatical categories of transitivity, intransitivity; positive and negative form; connotative in character is classified as positive, negative, neutral. For a deeper disclosure of the meaning of verbs, depending on the common and distinctive semas, they are internally divided into large (lexica-semantic) and small (semantic) groups.
The article was written within the framework of the research project BR11765619 «Development of the National Corpus of the Kazakh language as information-innovation state language base: research and training internet resource».