Yatsko's Computational Linguistics Laboratory

Here you can buy V. A. Yatsko's academic papers.

Each paper costs $11.

These works are sold by the author.

To get a paper please contact Dr Yatsko at


The following papers are available now.

1. A Classification of linguistic technologies. April 2019.

The paper suggests criteria for classification of linguistic technologies, such as type of input and output data, level of language system, medium and form of communication, underlying algorithms, and intended user audience. It describes distinctive features of linguistic programs, applications, and systems that take natural language text as their input. The paper introduces the concept of the linguistic information system and proposes the principles of universality and complementarity for its development. The systemic description of linguistic resources and technologies is important for creating a theoretical framework to configure linguistic software. The author discusses differences between the terms "computational linguistics" and "natural language processing".

2. Simple probabilities and methods for information processing. January 2019

The paper gives a generalized review of main notions of probability theory and describes distinctions of simple probabilities. Classic, ranked, and reduced formulas representing simple probabilities are distinguished. Special emphasis is made on the use of simple probabilities with the purpose of smoothing differences in sizes of statistical samples. Russian version of the paper was published in Nauchno-technicheskay informatsia. Ser. 2, 2018, No 8, p.1-7 Cite this paper as Yatsko, V.A. Simple Probabilities and Methods for Information Processing. In Nauchno-technicheskay informatsia. Ser. 2, 2018, No 8, p.1-7.

3. Informatics, information Science, and computer science. November 2018. In: Scientific and Technical Information Processing 45(4):235-240 DOI: 10.3103/S014768821804008

In this paper, I draw a distinction between information and computer sciences in terms of their objects and directions of research. In terms of the operating mode, I distinguish among automatic, automated, semi-automatic, and assistant systems and show that in each application domain there can be different configurations of these systems at different periods of time. In addition, I analyze the engineering, linguistic, and mathematical aspects of domain-specific research that falls within scope of informatics to formulate some problems, and discuss promising directions.

4. Another tagger. July 2018
The paper describes a hybrid part-of-speech tagger for Russian designed to support an opinion mining system. The tagger is based on the extensive morphological dictionary and Bayesian analysis. The paper suggests principles and methodologies for development of taggers for morphologically rich languages, such as the principle of lexical and morphological distribution, the principle of consecutive priority of lexical and morphological classes, the principle of morphological variation; lexical, lemma-based, and affixal methodologies for parts-of-speech recognition. Much attention is paid to the description of instances of sound alteration and deletion characteristic of contemporary Russian. A generalized algorithm of tagger's functioning is described that includes 44 steps. The algorithm was realized in a stand-alone application distributed as freeware. Russian version of the paper (shorter and outdated) was published in Integratsia Nauk journal (p. 78-83) and is available at http://in-sc.ru/d/1942991/d/vypusk_1014.pdf Key words: principles and methods for Russian parts-of-speech recognition, opinion mining system, morphological dictionary, lexical and morphological classes, morphological variation, part-of-speech tagger's algorithm.

5.  Bayes theorem and a methodology for prediction of US presidential elections resultsNovember 2017. 
The paper develops a methodology for prediction of US presidential elections results. The methodology involves the following procedures. 1) Select time horizon for statistical analysis. 2) Select parameters for statistical analysis. 3) Study configuration of parameters during the election year for which the forecast is made and extrapolate this configuration to previous years. 4) Apply Bayes theorem using statistical data about previous elections to calculate probabilities for the GOP and Democrats. 5) If some or all parameters are not known make a projection for the given year basing on results of previous campaigns and developing optimistic, pessimistic and intermediate scenarios. Basing on this methodology the paper first makes prediction for 2016 US presidential election taking its results as a reference data and then formulates the projection for 2020 basing on optimistic and pessimistic scenarios for Republicans and Democrats. Russian version of the paper was published in Integratsia Nauk journal, issue 10(14), 2017, pp.16-22.

6. The principles for the investigation of the historical development of computer science. July 2017.  In: Scientific and Technical Information Processing 44(3):207-214. DOI: 10.3103/S0147688217030108

The paper proposes the principles of publicity of historical process, consequency, and paradigmality. Depending on their significance, historical events are classified into critical, epochal, key, and occasional events. Local and global events are differentiated and possible variants of representation of local events at the global level are determined. The paper describes variants of correlation between the new paradigm and the old one, such as total rejection, partial rejection, mergence, and co-existence. It proves that the last variant is more characteristic of computer science. An attempt is made to distinguish “computer science” from “information science.”

7. Distinctive features of the structure of linguistic ontology. June 2017 In: Automatic Documentation and Mathematical Linguistics 25(3):149-158. DOI: 10.3103/S0005105517030128

This paper describes a methodology for developing a linguistic ontology as a component of a system for automatic analysis of customer opinions about commercial products. The fundamental principles of building ontologies of this type are substantiated, which include the following: the relationship between ontology and grammar; distinguishing parametric and evaluative terms in its structure and classification of evaluative terms into syntactic and semantic ones; the binary relationship between syntactic and semantic terms; the gradation scale of the intensity of evaluations. The cases of the homonymy and synonymy of evaluative terms are analyzed for the first time based on Russian data.

8. Evaluation of the efficiency of the chi-square metricJuly 2016. In Automatic Documentation and Mathematical Linguistics 50(4):173-178. DOI:

The efficiency of using the chi-square metrics to weigh terms used in text documents is evaluated. The procedure includes the selection and advanced processing of class C and ~C texts, compilation of a reference dictionary and calculation of scores for all the terms in the dictionary, calculation of χ2 coefficients for terms from a class C text, and calculation of the general efficiency factor by the sum of the coefficients found for the terms from the reference dictionary. The weighting by the χ2 formula, odds-ratio (OR) formula, and on the basis of probabilistic variables is analyzed and compared. It was found that the best result is yielded by the OR-based weighting.

9. The Methodology of Symmetric Weighting of Sentences. February 2016.
The paper describes the methodology of symmetric weighting of sentences that involves creation of a dictionary and calculation of connections between sentences. It demonstrates opportunities for application of this methodology for the purposes of text summarization and authorship attribution. It develops an original methodology to compare results of symmetric weighting with results of Copernic Summarizer and AutoSummarize function in MS Word. Based on standard deviation symmetric weighting can be used for the purpose of authorship attribution. This paper is the English translation of the Russian paper, which is available at http://lamb.viniti.ru/sid2/sid2free?sid2=J14210360 Original text formatting and pagination retained. Cite this paper as: Yatsko, V.A. (2016) The methodology of symmetric weighting of sentences. Naucno-technicheskaya informatsia. Ser.2. [Scientific and Technical Information. Series 2.] , 2:36-41.

10. Automatic text classification method based on Zipf's law. June 2015. In Automatic Documentation and Mathematical Linguistics 49(3):83-88. DOI: 10.3103/S0005105515030048
This paper describes a method for automatic text classification based on analyzing the deviation of the word distribution from Zipf’s law, combined with the zonal data processing approach. Deviation is understood as the difference between the actual numerical score of a word calculated based ob its frequency and its score calculated according to Zipf’s law. The proposed method involves the division of input and reference texts into J0, J1, and J2 zones, and the creation of a numerical series using the words that are contained in the J0 zone. The constructed numerical series shows the difference between the real scores of words and the scores calculated according to Zipf’s law. The proposed method can significantly reduce text dimensionality and thus improve the running speed of automatic text classification

11. Zonal Text Processing. June 2015.  Digital Scholarship in the Humanities.  Volume 31, Issue 4, December 2016, Pages 773–781, https://doi.org/10.1093/llc/fqv022 

The article describes methodology of zonal text processing based on interpretation of Bradford's law in terms of geometric progression. The methodology involves dividing the text into three zones (J0, J1, J2) and finding their composition. To verify the value of Bradford multiplier two methods that evaluate distribution of stop words across the three zones are used. The concept of zonal-correlational processing that implies contrastive analysis of J1 zones of two or more texts for the purpose of authorship attribution and classification is formulated and tested. To address the problem of difference in text sizes the concept of logarithmic equalizing is proposed.

12. Computational linguistics or linguistic informatics? May 2014. Automatic Documentation and Mathematical Linguistics 48(3). DOI:  10.3103/S0005105514030042

The concept of “linguistic informatics” is introduced in order to refer to a scientific domain that studies the distribution patterns of text information, as well as problems, principles, methods, and algorithms applied for the development of linguistic software and hardware. The key terms and concepts in the related field are investigated; a classification of linguistic software is introduced.

13. Positional-semantic approach to proper names recognition

The paper demonstrates how proper names recognition affects the task of anaphora and co-reference resolution and suggests a positional-semantic algorithm for recognition of anthroponyms. The algorithm involves generation  of a list of tokens spelt with initial capital letter and its step-by-
step filtering by matching against four types of dictionaries. The first dictionary comprises personal names; the second and the third ones consist of words denoting a person's title, position, degree; the fourth dictionary consists of verbs denoting states, processes, and actions characteristic of human beings, such as performative verbs, mental verbs, verbs denoting feelings and emotions, and sense perception. The suggested algorithm doesn't require POS-tagging or a grammar and can be employed in information retrieval and text summarization without affecting speed of text processing.
14. Automatic genre recognition and adaptive text summarization. June 2010. Automatic Documentation and Mathematical Linguistics 44(3):111-120. DOI:  10.3103/S0005105510030027 
This paper describes an experimental method for automatic text genre recognition based on 45 statistical, lexical, syntactic, positional, and discursive parameters. The suggested method includes: (1) the development of software permitting heterogeneous parameters to be normalized and clustered using the k-means algorithm; (2) the verification of parameters; (3) the selection of the parameters that are the most significant for scientific, newspaper, and artistic texts using two-factor analysis algorithms. Adaptive summarization algorithms have been developed based on these parameters. 

15. A method for evaluating modern systems of automatic text summarization. June 2007. Automatic Documentation and Mathematical Linguistics 41(3):93-103. DOI: 10.3103/S0005105507030041  

Four modern systems of automatic text summarization are tested on the basis of a model vocabulary composed by subjects. Distribution of terms of the vocabulary in the source text is compared with their distribution in summaries of different length generated by the systems. Principles for evaluation of the efficiency of the current systems of automatic text summarization are described. This paper describes an experimental method for automatic text genre recognition based on 45 statistical, lexical, syntactic, positional, and discursive parameters. The suggested method includes: (1) the development of software permitting heterogeneous parameters to be normalized and clustered using the k-means algorithm; (2) the verification of parameters; (3) the selection of the parameters that are the most significant for scientific, newspaper, and artistic texts using two-factor analysis algorithms. Adaptive summarization algorithms have been developed based on these parameters. 

16. TF*IDF Revisited. 2013.

This paper focuses on the advantages of term weighting according to TF*IDF formula and some problems which the researchers face, specifically ambiguity of requirements for size and structure of collection of text documents (corpora). Three interpretations of the formula are discussed and the assumption that results of computations depend on the genre of texts in collection is tested experimentally. The experiment demonstrated that contrastive matching of the analyzed text against a corpus of texts belonging in a different genre yields good results.

17.  A Semi-automatic text summarization system. 2005.  Conference on speech and Computer, 17-19 October, 2005 Patras, Greece, pp. 283-288.
The paper outlines characteristics of contemporary automatic text summarization systems and describes the architecture and functioning of PASS – a semi automatic text summarization system designed to be used in foreign language teaching to improve learners' speech skills. A generalized  scheme of functioning of  educational systems based on client-server architecture is presented.  

18. Textual deep structure.  1998. In Proceedings of TSD'98. 

Textual deep structure is characterized by synchronic, diachronic, and causative-consecutive  logical relations that constitute relational aspect of discourse that should be differentiated from  communicative aspect comprising lexical and grammatical manifestations of the logical relations in surface discourse structure. Three variants of correlation between deep and surface structures can be distinguished: 1) non-correspondence  (contradiction); 2) correspondence; 3)  inexplicability of logical relations in communicative aspect of discourse.

19. Compositional modeling: logical and linguistic principles of a new method of scientific-information activity. 1995.  In Automatic Documentation and Mathematical Linguistics. 1995, VOL 29; NUMBER 6, pages 23-30. 

The article deals with a new method of natural language academic discourse analysis. The method of compositional modeling includes four procedures: interpretation, reduction, normalization, and canonization. Implementation of these procedures allows for making up a compositional model of discourse reflecting logical relations between utterances within types of speech (narration, description, and reasoning). In the article, linguistic structure of reasoning is revealed and its classification is given. 

20.  Logical-semantic aspects of the concept of alienated knowledge. 1993.  In Automatic Documentation and Mathematical Linguistics 27(4):28-35

The paper investigates the notion of alienated knowledge basing on differentiation between three roles of the speaker in an utterance: subject of modal attitude, subject of reference, and subject of nomination. If the speaker fulfills all roles, the utterance presents his/her personal knowledge. If one of the roles is not fulfilled by the speaker, the utterance expresses alienated knowledge. The degree of alienation depends on the number of roles not fulfilled by the speaker, the greatest degree of alienation being characteristic of utterances, in which the speaker doesn't fulfill all three roles. Lexical and grammatical manifestations of the three roles are described. The author conducts contrastive analysis of utterances with similar lexical and grammatical structures in an academic paper and in its summary to find differences in their logical and semantic structures. 

Desktop Site