• In this paper we propose a natural approach to characterizing genomic sequences, based on occurrences of fixed length words (strings over the alphabet {A,C,G,T}) from a sufficiently large set W of arbitrary (in general case) words. According to our approach, any genomic sequence can be characterized by a histogram of frequencies of imperfect matching of words from the set W that is called a compositional spectrum (CS). The specificity of CSs is manifest in a reasonable similarity of spectra obtained on different stretches of the same genome and, simultaneously, in a broad range of dissimilarities between spectral characteristics of different genomes. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.

Valery M. Kirzhner, Abraham B. Korol, Alexander Bolshoy and Eviatar Nevo (2002) Compositional spectrum - revealing patterns for genomic sequence characterization and comparison. Physica A, 312, 447- 457.(http://www.sciencedirect.com/science/journal/03784371) (Article.cs)


  • We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet [A, C, G, T]). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.

V.M. Kirzhner, E. Nevo, A.B. Korol and A. Bolshoy. (2003). One promising approach to a large scale comparison of genomic sequences Acta Biotheoretica 51, 2, 73-89. (Article.comp)


  • With the availability of genome sequences, the possibility of new phylogenetic reconstructions arises in order to reveal genomic relationships among organisms. According to the compositional-spectra (CS) approach proposed in our previous studies, any genomic sequence can be characterized by a distribution of frequencies of imperfect matching of words (oligonucleotides). In the current application of CS-analysis, we attempted to analyze the cluster structure of genomes across life. It appeared that compositional spectra show a clear three-group clustering of the compared prokaryotic and eukaryotic genomes. Unexpectedly, this grouping seriously differs from the classical Universal Tree of Life structure represented by common kingdoms known as Eubacteria, Archaebacteria, and Eukarya. The revealed CS-clustering displays high stability, putatively reflecting its objective nature, and still enigmatic biological significance that may result from convergent evolution driven by ecological selection. We believe that our approach provides a new and wider (compared to traditional methods) perspective of extracting genomic information of high evolutionary relevance.

Valery Kirzhner, Alexander Bolshoy, Zeev Volkovich, Abraham Korol, Eviatar Nevo. (2005) Large scale genome clustering across life based on a linguistic approach. BioSystem 81,3, 208-222.(Article.clst)


  • This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.

Z. Volkovich, V. Kirzhner, A. Bolshoy, A. Korol and E. Nevo. (2005) The Method of N-grams in Large-Scale Clustering of DNA texts Pattern Recognition 38, 11, 1902-1912. (Article.ngram)


  • Several species-specific characteristics of genome organization that are superimposed on its coding aspects were proposed earlier, including genome signature, genome accent and compositional spectrum. These notions could be considered as representatives of genome dialect. We measured within the proteobacteria some genome dialect representatives: The relative abundance of dinucleotides, or genome signature; the profiles of occurrence of 10 nucleotide words" (compositional spectra), and the profiles of occurrence of 20 nucleotide words, using a degenerate two letter alphabet (purine-pyrimidine compositional spectra). Here, we show that the evolutionary distances between DNA repair and recombination orthologous enzymes (especially those of the nucleotide excision repair system) are highly correlated with purine-pyrimidine compositional spectra and genome signature distances. Orthologous proteins involved in structural or metabolic processes (control group), have significantly lower correlations of their evolutionary distances with the purine-pyrimidine compositional spectra and genome signature distances. We hypothesize that the high correlation of the evolutionary distances of the DNA repair orthologous enzymes with their genome's dialect is a result of the coevolution of the DNA repair enzymes structures and genome dialects. Species genome dialects could be substantially influenced by the function of DNA polymerase I (the bacterial major DNA repair polymerase). This might cause the correlation of species genome dialects differentiation with evolutionary changes of species DNA polymerase I. Simultaneously, the structures of DNA repair-recombination enzymes might be evolutionary sensitive and responsive to changes in the structure of their substrate - the DNA (including those that are represented by genome dialect differentiation). We further discuss the rationale and mechanisms of the hypothesized coevolution. We suggest that stress might be an important cause of changes in the repair-recombination genes and the genome dialect, and the trigger of the aforementioned coevolution process. Other triggers might be massive horizontal gene transfer and ecological selection.

A. Paz, V. Kirzhner, E. Nevo, and A. Korol (2005) Coevolution of DNA-Interacting Proteins and Genome "Dialect" Molecular Biology and Evolution 23(1):56-64. (http://mbe.oxfordjournals.org/cgi/content/abstract/msj007v1)


  • In this study, we have calculated distances between genomes based on our previously developed compositional spectra (CS) analysis. The study was conducted using genomes of 39 species of Eukarya, Eubacteria, and Archaea. Based on CS distances, we produced two different consensus dendrograms for four- and two-letter (purine-pyrimidine) alphabets. A comparison of the obtained structure using purine-pyrimidine alphabet with the standard three-kingdom (3K) scheme reveals substantial similarity. Surprisingly, this is not the case when the same procedure is based on the four-letter alphabet. In this situation, we also found three main clusters but different from those in the 3K scheme. In particular, one of the clusters includes Eukarya and thermophilic bacteria and a part of the considered Archaea species. We speculate that the key factor in the last classification (based on the A-T-G-C alphabet) is related to ecology: two ecological parameters, temperature and oxygen, distinctly explain the clustering revealed by compositional spectra in the four-letter alphabet. Therefore, we assume that this result reflects two interdependent processes: evolutionary divergence and superimposed ecological convergence of the genomes, albeit another process, horizontal transfer, cannot be excluded as an important contributing factor.

Valery Kirzhner, A. Paz, Z. Volkovich, E. Nevo and A. Korol (2007) Different Clustering of Genomes Across Life Using the A-T-C-G and Degenerate R-Y Alphabets: Early and Late Signaling on Genome Evolution? Journal of Molecular Evolution 64, 4:448-456.


  • In the present paper, 188 prokaryote genomes are classified by separately calculating the compositional spectra for the coding and the non-coding parts of the genomes. For each subsequence, the compositional spectrum is transformed into the corresponding point in a vector space. This enables the categorization of genomes into meaningful groups by a formal method. Repeated clustering performed for the coding and the non-coding genome parts makes it possible to estimate the true number of the genome clusters. The method we propose is based on a new application of external cluster validation indexes and on the misclassified quantities obtained in the process of repeated clustering. Besides, we have constructed additional data embedding into the appropriate Euclidean space only on the basis of the distances between compositional spectra. Biological evaluation of the results obtained for the 4-letter and the 2-letter alphabets substantiates the appropriateness of the resulting cluster-based classification.

Z. Volkovich, V. Kirzhner, Z. Barzily, S. Hosid and K. Korenblat,(2010) A Linguistic Approach to Classification of Bacterial Genomes/ Pattern Recognition 43,3: 1083-1093.


  • This book deals with the methods of text comparison which are based on different techniques of converting the text into a distribution on a certain finite support, be it a genetic text or a text of some other type. Such distribution is usually referred to as “spectrum”. The measure of dissimilarity of two texts is formally expressed as a certain “distance” between the spectra of these texts. Such definition implies that the similarity of the texts results from the similarity of the random processes generating the texts.

Alexander Bolshoy, Zeev Volkovich, Valery Kirzhner, Zeev Barzily. Genome Clustering: From Linguistic Models to Classification of Genetic Texts (Studies in Computational Intelligence), Springer, 2010.


V.Kirzhner, S. Frenkel and A. Korol,(2011) Minimal-Dot Plot: "Old Tale in New Skin" about Sequence Comparison, Information Sciences 181,8: 1454-1462.


V.Kirzhner, S. Frenkel and A. Korol,(2011) Non-alignment comparison of human and high primate genomes, arXiv:1111.6172v1 [q-bio.GN]


V.Kirzhner, S. Frenkel and A. Korol,(2012) Organizational Heterogeneity of Vertebrate Genomes, PLoS ONE 7(2):1-15.

Look also List of publication (current millenium)